I want to map over a Cassandra table in Spark but my code that executes
needs a shutdown() call to return any threads, release file handles, etc.
Will spark always execute my mappers as a forked process? And if so how do
I handle threads preventing the JVM from terminating.
It would be nice if
Hello Jose,
We've hit the same issue a couple of months ago. It is possible to write
directly to files instead of creating directories, but it is not
straightforward, and I haven't seen any clear demonstration in books,
tutorials, etc.
We do something like:
SparkConf sparkConf = new
Hello Experts,
I am running a spark-streaming app inside YARN. I have Spark History server
running as well (Do we need it running to access UI?).
The app is running fine as expected but the Spark's web UI is not
accessible.
When I try to access the ApplicationMaster of the Yarn application I
The error says Cannot assign requested address. This means that you need to
use the correct address for one of your network interfaces or 0.0.0.0 to
accept connections from all interfaces. Can you paste your spark-env.sh
file and /etc/hosts file.
Thanks
Best Regards
On Wed, Feb 18, 2015 at 2:06
Check for the dependencies. Looks like you have a conflict around servlet-api
jars.
Maven's dependency-tree, some exclusions and some luck :) could help.
From: Ralph Bergmann | the4thFloor.eu [ra...@the4thfloor.eu]
Sent: Tuesday, February 17, 2015 4:14 PM
Hello Imran,
(a) I know that all 20 files are processed when I use foreachRDD, because I
can see the processed files in the output directory. (My application logic
writes them to an output directory after they are processed, *but* that
writing operation does not happen in foreachRDD, below you
On Wed, Feb 18, 2015 at 10:23 AM, bit1...@163.com bit1...@163.com wrote:
Sure, thanks Akhil.
A further question : Is local file system(file:///) not supported in
standalone cluster?
FYI: I'm able to write to local file system (via HDFS API and using
file:/// notation) when using Spark.
--
Thanks to everyone for suggestions and explanations.
Currently I've started to experiment with the following scenario, that
seems to work for me:
- Put the properties file on a web server so that it is centrally available
- Pass it to the Spark driver program via --conf 'propertiesFile=http:
I'm trying to access the Spark UI for an application running through YARN.
Clicking on the Application Master under Tracking UI I get an HTTP ERROR
500:
HTTP ERROR 500
Problem accessing /proxy/application_1423151769242_0088/. Reason:
Connection refused
Caused by:
Can you track your comments on the existing issue?
https://issues.apache.org/jira/browse/SPARK-5837
I personally can't reproduce this but more info would help narrow it down.
On Wed, Feb 18, 2015 at 10:58 AM, rok rokros...@gmail.com wrote:
I'm trying to access the Spark UI for an application
Hi,
I'm writing a Spark program where I want to divide a RDD into different
groups, but the groups are too big to use groupByKey. To cope with that,
since I know in advance the list of keys for each group, I build a map from
the keys to the RDDs that result from filtering the input RDD to get the
Hi
We're using Spark in our app's unit tests. The tests start spark
context with local[*] and test time now is 178 seconds on spark 1.2
instead of 41 seconds on 1.0.
We are using spark version from cloudera CDH (1.2.0-cdh5.3.1).
Could you give some hints what could cause that? and where to
Hello,
Could you please add Big Industries to the Powered by Spark page at
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ?
Company Name: Big Industries
URL: http://http://www.bigindustries.be/
Spark Components: Spark Streaming
Use Case: Big Content Platform
Summary:
But I am able to run the SparkPi example:
./run-example SparkPi 1000 --master spark://192.168.26.131:7077
Result:Pi is roughly 3.14173708
bit1...@163.com
From: bit1...@163.com
Date: 2015-02-18 16:29
To: user
Subject: Problem with 1 master + 2 slaves cluster
Hi sparkers,
I setup a
when you give file:// , while reading, it requires that all slaves has that
path/file available locally in their system. It's ok to give file:// when
you run your application in local mode (like master=local[*])
Thanks
Best Regards
On Wed, Feb 18, 2015 at 2:58 PM, Emre Sevinc
It seems like that its not able to get a port it needs are you sure that
the required port is available. In what logs did you find this error?
On Wed, Feb 18, 2015 at 2:21 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
The error says Cannot assign requested address. This means that you need
To clarify, sometimes in the world of Hadoop people freely refer to an
output 'file' when it's really a directory containing 'part-*' files which
are pieces of the file. It's imprecise but that's the meaning. I think the
scaladoc may be referring to 'the path to the file, which includes this
Thanks Francois for the comment and useful link. I understand the problem
better now.
best,
/Shahab
On Wed, Feb 18, 2015 at 10:36 AM, francois.garil...@typesafe.com wrote:
In a nutshell : because it’s moving all of your data, compared to other
operations (e.g. reduce) that summarize it in
Hi
What is the general consensus/roadmap for implementing additional online /
streamed trainable models?
Apache Spark 1.2.1 currently supports streaming linear regression
clustering, although other streaming linear methods are planned according to
the issue tracker.
However, I can not find any
Since the cluster is standalone, you are better off reading/writing to hdfs
instead of local filesystem.
Thanks
Best Regards
On Wed, Feb 18, 2015 at 2:32 PM, bit1...@163.com bit1...@163.com wrote:
But I am able to run the SparkPi example:
./run-example SparkPi 1000 --master
Hi Sasi,
Forgot to mention job server uses Typesafe Config library. The input is
JSON, you can find syntax in below link
https://github.com/typesafehub/config
Regards,
Vasu C
--
View this message in context:
Forked, meaning, different from the driver? Spark will in general not
even execute your tasks on the same machine as your driver. The driver can
choose to execute a task locally in some cases.
You are creating non-daemon threads in your function? your function can and
should clean up after
In a nutshell : because it’s moving all of your data, compared to other
operations (e.g. reduce) that summarize it in one form or another before moving
it.
For the longer answer:
Sure, thanks Akhil.
A further question : Is local file system(file:///) not supported in standalone
cluster?
bit1...@163.com
From: Akhil Das
Date: 2015-02-18 17:35
To: bit1...@163.com
CC: user
Subject: Re: Problem with 1 master + 2 slaves cluster
Since the cluster is standalone, you are
Hi,
Based on what I could see in the Spark UI, I noticed that groupBy
transformation is quite slow (taking a lot of time) compared to other
operations.
Is there any reason that groupBy is slow?
shahab
Hi,
I want to run my spark Job in Hadoop yarn Cluster mode,
I am using below command -
spark-submit --master yarn-cluster --driver-memory 1g --executor-memory 1g
--executor-cores 1 --class com.dc.analysis.jobs.AggregationJob
sparkanalitic.jar param1 param2 param3
I am getting error as under,
Hi all,
I am trying to create RDDs from within /rdd.foreachPartition()/ so I can
save these RDDs to ElasticSearch on the fly :
stream.foreachRDD(rdd = {
rdd.foreachPartition {
iterator = {
val sc = rdd.context
iterator.foreach {
case (cid,
hi, I have a job that fails on a shuffle during a sortByKey, on a
relatively small dataset. http://pastebin.com/raw.php?i=1LxiG4rY
Thanks, Cody. Yes, I originally started off by looking at that but I get a
compile error if I try and use that approach: constructor JdbcRDD in class
JdbcRDDT cannot be applied to given types. Not to mention that
JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last
argument).
Would you mind explaining your problem a little more specifically, like
exceptions you met or others, so someone who has experiences on it could
give advice.
Thanks
Jerry
2015-02-19 1:08 GMT+08:00 athing goingon athinggoin...@gmail.com:
hi, I have a job that fails on a shuffle during a
Take a look at
https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java
On Wed, Feb 18, 2015 at 11:14 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:
I'm reading data from a database using JdbcRDD, in Java, and I have an
implementation of
Is sc there a SparkContext or a JavaSparkContext? The compilation error
seems to indicate the former, but JdbcRDD.create expects the latter
On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I have tried that as well, I get a compile error --
[ERROR]
does anyone have the right maven invocation for cdh5 with yarn?
i tried:
$ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests clean
package
$ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn test
it builds and passes tests just fine, but when i deploy on cluster and i
try to
Hello,
On Tue, Feb 17, 2015 at 8:53 PM, dgoldenberg dgoldenberg...@gmail.com wrote:
I've tried setting spark.files.userClassPathFirst to true in SparkConf in my
program, also setting it to true in $SPARK-HOME/conf/spark-defaults.conf as
Is the code in question running on the driver or in some
thanks! my bad
On Wed, Feb 18, 2015 at 2:00 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Koert,
You should be using -Phadoop-2.3 instead of -Phadoop2.3.
-Sandy
On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers ko...@tresata.com wrote:
does anyone have the right maven invocation for
Ashutosh,
Were you able to figure this out? I am having the exact some question.
I think the answer is to use Spark SQL to create/load a table in Hive (e.g.
execute the HiveQL CREATE TABLE statement) but I am not sure. Hoping for
something more simple than that.
Anybody?
Thanks!
--
View
Hi Sean,
Thanks a lot for your answer. That explains it, as I was creating thousands
of RDDs, so I guess the communication overhead was the reason why the Spark
job was freezing. After changing the code to use RDDs of pairs and
aggregateByKey it works just fine, and quite fast.
Again, thanks a
I'm fairly new to Spark.
We have data in avro files on hdfs.
We are trying to load up all the avro files (28 gigs worth right now) and do
an aggregation.
When we have less than 200 tasks the data all runs and produces the proper
results. If there are more than 200 tasks (as stated in the logs by
That test I linked
https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java#L90
is calling a static method JdbcRDD.create, not new JdbcRDD. Is that what
you tried doing?
On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com
I have tried that as well, I get a compile error --
[ERROR] ...SparkProto.java:[105,39] error: no suitable method found for
create(SparkContext,anonymous
ConnectionFactory,String,int,int,int,anonymous
FunctionResultSet,Integer)
The code is a copy and paste:
JavaRDDInteger jdbcRDD =
Hey,
It seems pretty clear that one of the strength of Spark is to be able to
share your code between your batch and streaming layer. Though, given that
Spark streaming uses DStream being a set of RDDs and Spark uses a single
RDD there might some complexity associated with it.
Of course since
I find monoids pretty useful in this respect, basically separating out the
logic in a monoid and then applying the logic to either a stream or a
batch. A list of such practices could be really useful.
On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Hey,
It
If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it.
http://www.meetup.com/Hamburg-Apache-Spark-Meetup/
On Tue, Feb 17, 2015 at 5:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Thanks! I've added you.
Matei
On Feb 17, 2015, at 4:06 PM, Ralph Bergmann |
Hi Koert,
You should be using -Phadoop-2.3 instead of -Phadoop2.3.
-Sandy
On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers ko...@tresata.com wrote:
does anyone have the right maven invocation for cdh5 with yarn?
i tried:
$ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests
I am working right now with the ML pipeline, which I really like it.
However in order to make a real use of it, I would like create my own
transformers that implements org.apache.spark.ml.Transformer. In order to
do that, a method from the PipelineStage needs to be implemented. But this
method is
I'm using Solrj in a Spark program. When I try to send the docs to Solr, I
get the NotSerializableException on the DefaultHttpClient. Is there a
possible fix or workaround?
I'm using Spark 1.2.1 with Hadoop 2.4, SolrJ is version 4.0.0.
final HttpSolrServer solrServer = new
Concurrent inserts into the same table are not supported. I can try to
make this clearer in the documentation.
On Tue, Feb 17, 2015 at 8:01 PM, Vasu C vasuc.bigd...@gmail.com wrote:
Hi,
I am running spark batch processing job using spark-submit command. And
below is my code snippet.
I'm not sure what on the driver means but I've tried
setting spark.files.userClassPathFirst to true,
in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf
programmatically; it appears to be ignored. The solution was to follow
Emre's recommendation and downgrade the selected Solrj
Cody, you were right, I had a copy and paste snag where I ended up with a
vanilla SparkContext rather than a Java one. I also had to *not* use my
function subclasses, rather just use anonymous inner classes for the
Function stuff and that got things working. I'm fully following
the JdbcRDD.create
Hi,
This is a duplicate of the stack-overflow question here
http://stackoverflow.com/questions/28569374/spark-returning-pickle-error-cannot-lookup-attribute.
I hope to generate more interest on this mailing list.
*The problem:*
I am running into some attribute lookup problems when trying to
Hi ,
I created some hive tables and trying to list them through Beeline , but not
getting any results. I can list the tables through spark-sql.
When I connect beeline, it starts up with following message :
Connecting to jdbc:hive2://tst001:10001
Enter username for jdbc:hive2://tst001:10001:
Cant you implement the
org.apache.spark.api.java.function.Function
interface and pass an instance of that to JdbcRDD.create ?
On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Cody, you were right, I had a copy and paste snag where I ended up with a
vanilla
Hi,
Is it possible to configure spark to run without admin permission on windows?
My current setup run master slave successfully with admin permission.
However, if I downgrade permission level from admin to user, SparkPi fails with
the following exception on the slave node:
Exception in thread
Monoids are useful in Aggregations and try avoiding Anonymous functions,
creating out functions out of the spark code allows the functions to be
reused(Possibly between Spark and Spark Streaming)
On Thu, Feb 19, 2015 at 6:56 AM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Thanks Arush. I will
You need not require admin permission, but just make sure all those jars
has execute permission ( read/write access)
Thanks
Best Regards
On Thu, Feb 19, 2015 at 11:30 AM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Hi,
Is it possible to configure spark to run without admin
Thanks Imran, I'll try your suggestions.
I eventually got this to run by 'checkpointing' the joined RDD (according
to Akhil's suggestion) before performing the reduceBy, and then
checkpointing it again afterward. i.e.
val rdd2 = rdd.join(rdd, numPartitions=1000)
.map(fp=((fp._2._1, fp._2._2),
This feature request is already being tracked:
https://issues.apache.org/jira/browse/SPARK-4981
Aiming for 1.4
Best,
Reza
On Wed, Feb 18, 2015 at 2:40 AM, mucaho muc...@yahoo.com wrote:
Hi
What is the general consensus/roadmap for implementing additional online /
streamed trainable models?
Thank you very much Vasu. Let me add some more points to my question. We are
developing a Java program for connection spark-jobserver to Vaadin (Java
framework). Following is the sample code I wrote for connecting both (the
code works fine):
/
URL url = null;
HttpURLConnection connection = null;
Hi,
I have dependency problems to use spark-core inside of a HttpServlet
(see other mail from me).
Maybe I'm wrong?!
What I want to do: I develop a mobile app (Android and iOS) and want to
connect them with Spark on backend side.
To do this I want to use Tomcat. The app uses https to ask
I am implementing a stream learner for text classification. There are some
single-valued parameters in my implementation that needs to be updated as
new stream items arrive. For example, I want to change learning rate as the
new predictions are made. However, I doubt that there is a way to
At some level, enough RDDs creates hundreds of thousands of tiny
partitions of data each of which creates a task for each stage. The
raw overhead of all the message passing can slow things down a lot. I
would not design something to use an RDD per key. You would generally
use key by some value you
Are you proposing I downgrade Solrj's httpclient dependency to be on par with
that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest?
Solrj has to stay with its selected version. I could try and rebuild Spark with
the latest httpclient but I've no idea what effects that may
Hi
Did you try these steps.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
Thanks
Arush
On Wed, Feb 18, 2015 at 7:20 PM, sandeepvura sandeepv...@gmail.com wrote:
Hi ,
I am new to sparks.I had installed spark on 3 node cluster.I would like to
integrate
You can't use RDDs inside RDDs. RDDs are managed from the driver, and
functions like foreachRDD execute things on the remote executors.
You can write code to simply directly save whatever you want to ES.
There is not necessarily a need to use RDDs for that.
On Wed, Feb 18, 2015 at 11:36 AM, t1ny
I am running Spark Jobs behind tomcat. We didn't face any issues.But for us
the user base is very small.
The possible blockers could be
1. If there are many users of the system. Then jobs might have to w8, you
might want to think about the kind of scheduling you want to do.
2.Again if the no of
Hi Paweł,
Thanks a lot for your answer. I finally got the program to work by using
aggregateByKey, but I was wondering why creating thousands of RDDs doesn't
work. I think that could be interesting for using methods that work on RDDs
like for example JavaDoubleRDD.stats() (
Hello Dmitry,
I had almost the same problem and solved it by using version 4.0.0 of SolrJ:
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-solrj/artifactId
version4.0.0/version
/dependency
In my case, I was lucky that version 4.0.0 of SolrJ had all the
functionality
Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would
that not collide with Spark/Hadoop's default dependency on HttpClient set
to 4.2.6? If that's the case that might just solve the problem.
Would Solrj 4.0.0 work with the latest Solr, 4.10.3?
On Wed, Feb 18, 2015 at 10:50
Thanks Sean, but I don't think that fitting into memory is the case,
because:
1- I can see in the UI that 100% of RDD is cached, (moreover the RDD is
quite small, 100 MB, while worker has 1.5 GB)
2- I also tried MEMORY_AND_DISK, but absolutely no difference !
Probably I have messed up somewhere
On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would
that not collide with Spark/Hadoop's default dependency on HttpClient set
to 4.2.6? If that's the case that might just solve the problem.
Thanks, Emre! Will definitely try this.
On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc emre.sev...@gmail.com wrote:
On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would
that not collide
This does not sound like a Spark problem -- doesn't even necessarily
sound like a distributed problem. Are you of a scale where building
simple logic in a web tier that queries a NoSQL / SQL database doesn't
work?
If you are at such a scale, then it sounds like you're describing a
very high
Thanks for all the responses so far! I have started to understand the
system more, but I just had another question while I was going along. Is
there a way to check the individual partitions of an RDD? For example, if I
had a graph with vertices a,b,c,d and it was split into 2 partitions could
I
I think I'm going to have to rebuild Spark with commons.httpclient.version
set to 4.3.1 which looks to be the version chosen by Solrj, rather than the
4.2.6 that Spark's pom mentions. Might work.
On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
Hi
Did you
sorry for the noise. I have found it..
At 2015-02-18 23:34:40, Todd bit1...@163.com wrote:
Looks the log anylysis reference app provided by Databricks at
https://github.com/databricks/reference-apps only has java API?
I'd like to see the Scala version one.
Looks the log anylysis reference app provided by Databricks at
https://github.com/databricks/reference-apps only has java API?
I'd like to see the Scala version one.
Hi ,
I am new to sparks.I had installed spark on 3 node cluster.I would like to
integrate hive on spark .
can anyone please help me on this,
Regards,
Sandeep.v
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-hive-on-spark-tp21702.html
Maybe you can omit using grouping all together with groupByKey? What is
your next step after grouping elements by key? Are you trying to reduce
values? If so then I would recommend using some reducing functions like for
example reduceByKey or aggregateByKey. Those will first reduce value for
each
Although you can do lots of things, I don't think Spark is something
you should think of as a synchronous, real-time query API. So, somehow
trying to use it directly from a REST API is probably not the best
architecture.
That said, it depends a lot on what you are trying to do. What are you
so if you only change this line:
https://gist.github.com/emres/0fb6de128baea099e741#file-mymoduledriver-java-L137
to
json.print()
it processes 16 files instead? I am totally perplexed. My only
suggestions to help debug are
(1) see what happens when you get rid of MyModuleWorker completely --
The mostly likely explanation is that you wanted to put all the
partitions in memory and they don't all fit. Unless you asked to
persist to memory or disk, some partitions will simply not be cached.
Consider using MEMORY_OR_DISK persistence.
This can also happen if blocks were lost due to node
Hi,
Am 18.02.15 um 15:58 schrieb Sean Owen:
That said, it depends a lot on what you are trying to do. What are you
trying to do? You just say you're connecting to spark.
There are 2 tasks I want to solve with Spark.
1) The user opens the mobile app. The app sends a pink to the backend.
When
I am still debugging it but I believe if m% of users have unusually large
columns and the RDD partitioner on RowMatrix is hashPartitioner then due to
the basic algorithm without sampling, some partitions can cause unusually
large number of keys...
If my debug shows that I will add a custom
Hi Tom,
there are a couple of things you can do here to make this more efficient.
first, I think you can replace your self-join with a groupByKey. on your
example data set, this would give you
(1, Iterable(2,3))
(4, Iterable(3))
this reduces the amount of data that needs to be shuffled, and
Currently, PySpark can not support pickle a class object in current
script ( '__main__'), the workaround could be put the implementation
of the class into a separate module, then use bin/spark-submit
--py-files xxx.py in deploy it.
in xxx.py:
class test(object):
def __init__(self, a, b):
That's exactly what I was doing. However, I ran into runtime issues with
doing that. For instance, I had a
public class DbConnection extends AbstractFunction0Connection
implements Serializable
I got a runtime error from Spark complaining that DbConnection wasn't an
instance of scala.Function0.
Hi Cesar,
Thanks for trying out Pipelines and bringing up this issue! It's been an
experimental API, but feedback like this will help us prepare it for
becoming non-Experimental. I've made a JIRA, and will vote for this being
protected (instead of private[ml]) for Spark 1.3:
No suitable driver found error, Create table in hive from spark sql.
I am trying to execute following example.
SPARKGIT:
spark/examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala
My setup :- hadoop 1.6,spark 1.2, hive 1.0, mysql server (installed via yum
install
Hi,
Thanks for your reply.
I basically want to check if my understanding what parallelize() on RDDs is
correct. In my case, I create a vertex RDD and edge RDD and distribute them
by calling parallelize(). Now does Spark perform any operation on these
RDDs in parallel?
For example, if I apply
+1 for writing the Spark output to Kafka. You can then hang off multiple
compute/storage framework from kafka. I am using a similar pipeline to feed
ElasticSearch and HDFS in parallel. Allows modularity, you can take down
ElasticSearch or HDFS for maintenance without losing (except for some edge
Thanks Arush. I will check that out.
On Wed, Feb 18, 2015 at 11:06 AM, Arush Kharbanda
ar...@sigmoidanalytics.com wrote:
I find monoids pretty useful in this respect, basically separating out the
logic in a monoid and then applying the logic to either a stream or a
batch. A list of such
Found solution from one of the post found on internet.
I updated spark/bin/compute-classpath.sh and added database connector jar
into classpath.
CLASSPATH=$CLASSPATH:/data/mysql-connector-java-5.1.14-bin.jar
--
View this message in context:
Thanks for the advice folks, it is much appreciated. This seems like a pretty
unfortunate design flaw. My team was surprised by it.
I’m going to drop the two-step process and do it all in a single step until we
get Kafka online.
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday,
This is a *fantastic* question. The idea of how we identify individual
things in multiple DStreams is worth looking at.
The reason being, that you can then fine tune your streaming job, based on
the RDD identifiers (i.e. are the timestamps from the producer correlating
closely to the order in
You need to instantiate the server in the forEachPartition block or Spark will
attempt to serialize it to the task. See the design patterns section in the
Spark Streaming guide.
Jose Fernandez | Principal Software Developer
jfernan...@sdl.com |
The information transmitted, including
There does not seem to be a definitive answer on this. Every time I google
for message ordering,the only relevant thing that comes up is this -
http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html
.
With a kafka receiver that pulls data from a single kafka partition
97 matches
Mail list logo