Thanks Imran, but I do appreciate if you explain what this mean and what
are the reasons make it happening. I do need it.
If there is any documentation somewhere you can simply direct me there so
I can try to understand it myself.
best,
/Shahab
On Sat, Feb 21, 2015 at 12:26 AM, Imran Rashid
Hi,
This has already been briefly discussed here in the past but there seems to be
more questions...
I am running bigger ALS task with input data ~40GB (~3 billions of ratings).
The data is partitioned into 512 partitions and I am also using default
parallelism set to 512. The ALS runs with
No, unfortunately we're not making use of dynamic allocation or the
external shuffle service. Hoping that we could reconfigure our cluster to
make use of it, but since it requires changes to the cluster itself (and
not just the Spark app), it could take some time.
Unsure if task 450 was acting as
Hi,
If I repartition my data by a factor equal to the number of worker
instances, will the performance be better or worse?
As far as I understand, the performance should be better, but in my case it
is becoming worse.
I have a single node standalone cluster, is it because of this?
Am I guaranteed
I am in the same situation with this guy, does anyone face and fixed this
error ?
http://stackoverflow.com/questions/24610462/why-does-spark-shell-master-yarn-client-fail-yet-pyspark-master-yarn-seems
Thanks,
Quang.
--
View this message in context:
Makes complete sense - I became a fan of Spark for pretty much the same
reasons. Best of luck, eh?!
On Mon Feb 23 2015 at 12:08:49 AM Francisco Orchard forch...@gmail.com
wrote:
Hi Denny Ashic,
You are putting us on the right direction. Thanks!
We will try following your advice and
Hi Joseph,
Thank you for you feedback. I've managed to define an image type by
following VectorUDT implementation.
I have another question about the definition of a user defined transformer.
The unary tranfromer is private to spark ml. Do you plan
to give a developer api for transformers ?
On
Hi Marius,
Are you using the sort or hash shuffle?
Also, do you have the external shuffle service enabled (so that the Worker
JVM or NodeManager can still serve the map spill files after an Executor
crashes)?
How many partitions are in your RDDs before and after the problematic
shuffle
Hi Sameer,
I’m still using Spark 1.1.1, I think the default is hash shuffle. No external
shuffle service.
We are processing gzipped JSON files, the partitions are the amount of input
files. In my current data set we have ~850 files that amount to 60 GB (so ~600
GB uncompressed). We have 5
Or, use the SparkOnHBase
lab.http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
From: Ted Yu yuzhih...@gmail.com
To: Akhil Das ak...@sigmoidanalytics.com
Cc: sandeep vura sandeepv...@gmail.com; user@spark.apache.org
user@spark.apache.org
Sent: Monday, February
Hello,
I am following the Movies recommendation with MLlib tutorial
(https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html).
However, I get RMSE that are much larger than what's written at step 7:
The best model was trained with rank = 8 and lambda = 1.0, and numIter =
Thanks, Sean! Yes, I agree that this logging would still have some cost and
so would not be used in production.
On Sat, Feb 21, 2015 at 1:37 AM, Sean Owen so...@cloudera.com wrote:
I think the cheapest possible way to force materialization is something
like
rdd.foreachPartition(i = None)
I
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
are slots for task threads. The slot count is configured by the num_cores
setting. Generally over subscribe this. So if you have 10 free CPU cores,
set num_cores to 20.
On Monday, February 23, 2015, Deep Pradhan
Nabble is a third-party site. If you send stuff through Nabble, Nabble has
to forward it along to the Apache mailing list. If something goes wrong
with that, you will have a message show up on Nabble that no-one saw.
The reverse can also happen, where something actually goes out on the list
and
I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
but still, it should work…
Nope, it shouldn’t, unfortunately. The Spark base AMIs are custom-built for
spark-ec2. No other AMI will work unless it was built with that goal in
mind. Using a random AMI from the Amazon
Thanks, I'll read that paper. We haven't tried with a cluster so big,
but it's suppose we should in the future and I was worried about it.
I'll comment something if you finally do, but it's not going to be
tomorrow :)
2015-02-23 17:38 GMT+01:00 Mosharaf Chowdhury mosharafka...@gmail.com:
Hi
In general you should first figure out how many task slots are in the
cluster and then repartition the RDD to maybe 2x that #. So if you have a
100 slots, then maybe RDDs with partition count of 100-300 would be normal.
But also size of each partition can matter. You want a task to operate on a
Have you looked at Kylin?
http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/#.VOtXUUsqnUk
Pretty new but has the backing of eBay.
On 23 Feb 2015, at 15:38, Denny Lee denny.g@gmail.com wrote:
Makes complete sense - I became a fan of Spark for pretty
What do you mean?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
sortByKey() is the probably the easiest way:
import org.apache.spark.SparkContext._
joinedRdd.map{case(word, (file1Counts, file2Counts)) = (file1Counts,
(word, file1Counts, file2Counts))}.sortByKey()
On Mon, Feb 23, 2015 at 10:41 AM, Anupama Joshi anupama.jo...@gmail.com
wrote:
Hi ,
To
The Spark cluster has no memory allocated.
Memory: 0.0 B Total, 0.0 B Used
From: Surendran Duraisamy 2013ht12...@wilp.bits-pilani.ac.in
To: user@spark.apache.org
Sent: Sunday, February 22, 2015 6:00 AM
Subject: Running Example Spark Program
Hello All,
I am new to Apache Spark,
Hi Guillermo,
The current broadcast algorithm in Spark approximates the one described in
the Section 5 of this paper
http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf.
It is expected to scale sub-linearly; i.e., O(log N), where N is the number
of machines in your cluster.
We
Hi guys,
I keep running into a strange problem where my jobs start to fail with the
dreaded Resubmitted (resubmitted due to lost executor)” because of having too
many temp files from previous runs.
Both /var/run and /spill have enough disk space left, but after a given amount
of jobs have
I'm looking for about how scale broadcast variables in Spark and what
algorithm uses.
I have found
http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf
I don't know if they're talking about the current version (1.2.1)
because the file was created in 2010.
I
I don't think current Spark Streaming supports window operations which beyond
its available memory, internally Spark Streaming puts all the data in the
memory belongs to the effective window, if the memory is not enough,
BlockManager will discard the blocks at LRU policy, so something
Hi ,
To simplify my problem -
I have 2 files from which I reading words.
the o/p is like
file 1
aaa 4
bbb 6
ddd 3
file 2
ddd 2
bbb 6
ttt 5
if I do file1.join(file2)
I get (ddd(3,2)
bbb(6,6)
If I want to sort the output by the number of occurances of the word i file1
. How do I achive
Installing hbase on hadoop cluster would allow hbase to utilize features
provided by hdfs, such as short circuit read (See '90.2. Leveraging local
data' under http://hbase.apache.org/book.html#perf.hdfs).
Cheers
On Sun, Feb 22, 2015 at 11:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
If
How is task slot different from # of Workers?
so don't read into any performance metrics you've collected to
extrapolate what may happen at scale.
I did not get you in this.
Thank You
On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
wrote:
In general you should first
Hi,
Our application is being designed to operate at all times on a large
sliding window (day+) of data. The operations performed on the window
of data will change fairly frequently and I need a way to save and
restore the sliding window after an app upgrade without having to wait
the duration of
I'm looking @ my yarn container logs for some of the executors which appear
to be failing (with the missing shuffle files). I see exceptions that say
client.TransportClientFactor: Found inactive connection to host/ip:port,
closing it.
Right after that I see shuffle.RetryingBlockFetcher: Exception
Just to close the loop in case anyone runs into the same problem I had.
By setting --hadoop-major-version=2 when using the ec2 scripts, everything
worked fine.
Darin.
- Original Message -
From: Darin McBeath ddmcbe...@yahoo.com.INVALID
To: Mingyu Kim m...@palantir.com; Aaron Davidson
Sounds very similar to what I experienced Corey. Something that seems to at
least help with my problems is to have more partitions. Am already fighting
between ending up with too many partitions in the end and having too few in
the beginning. By coalescing at late as possible and avoiding too few
This also trigger an interesting question: how can I do this locally by
code if I want. For example: I have RDD A and B, which has some partition,
then if I want to join A to B, I might just want to do a mapper side join
(although B itself might be big, but B's local partition is known small
Hi,I'm using MLLib to train a random forest. It's working fine to depth 15,
but if I use depth 20 I get a*java.lang.OutOfMemoryError: Requested array
size exceeds VM limit* on the driver, from the collectAsMap operation in
DecisionTree.scala, around line 642.It doesn't happen until a good hour
It doesn't mean 'out of memory'; it means 'you can't allocate a byte[]
over 2GB in the JVM'. Something is serializing a huge block somewhere.
there are a number of related JIRAs and discussions on JIRA and this
mailing list; have a browse of those first for back story.
On Mon, Feb 23, 2015 at
You are right. I was looking at the wrong logs. I ran it on my local
machine and saw that the println actually wrote the vertexIds. I was then
able to find the same in the executors' logs in the remote machine.
Thanks for the clarification.
On Mon, Feb 23, 2015 at 2:00 PM, Sean Owen
Hi Deepak,
Thanks for posting the link.Looks Like it supports only for cloudera
distributions as per given in github.
We are using apache hadoop multinode cluster not cloudera
distribution.Please confirm me whether i can use it on apache hadoop
cluster.
Regards,
Sandeep.v
On Mon, Feb 23, 2015
In the snippet below,
graph.edges.groupBy[VertexId](f1).foreach {
edgesBySrc = {
f2(edgesBySrc).foreach {
vertexId = {
*println(vertexId)*
}
}
}
}
f1 is a function that determines how to group the edges (in my case it
groups by source vertex)
f2 is another
It may involve access an element of an RDD from a remote machine and
copying it back to the driver. That and the small overhead of job
scheduling could be a millisecond.
You're comparing to just reading an entry from memory, which is of
course faster.
I don't think you should think of an RDD as
Hi,
I just wonder what would be the access time to take one element from a
cached RDD? if I have understood correctly, access to RDD elements is not
as fast as accessing e.g. HashMap and it could take up to mili seconds
compare to nano seconds in HashMap, which is quite significant difference
if
Here, println isn't happening on the driver. Are you sure you are
looking at the right machine's logs?
Yes this may be parallelized over many machines.
On Mon, Feb 23, 2015 at 6:37 PM, kvvt kvi...@vt.edu wrote:
In the snippet below,
graph.edges.groupBy[VertexId](f1).foreach {
edgesBySrc =
I'd imagine that myRDD.take(10).foreach(println) is the most
straightforward thing but yeah you can probably change shell default
behavior too.
On Mon, Feb 23, 2015 at 7:15 PM, Mark Hamstra m...@clearstorydata.com wrote:
That will produce very different output than just the 10 items that Manas
There are different kinds of checkpointing going on. updateStateByKey
requires RDD checkpointing which can be enabled only by called
sparkContext.setCheckpointDirectory. But that does not enable Spark
Streaming driver checkpoints, which is necessary for recovering from driver
failures. That is
Yes, if you're willing to add an explicit foreach(println), then that is
the simplest solution. Else changing maxPrintString should modify the
default output of the Scala/Spark REPL.
On Mon, Feb 23, 2015 at 11:25 AM, Sean Owen so...@cloudera.com wrote:
I'd imagine that
Hi experts,
I am using Spark 1.2 from CDH5.3.
When I issue commands like
myRDD.take(10) the result gets truncated after 4-5 records.
Is there a way to configure the same to show more items?
..Manas
Dear Patrick,
Thanks a lot again for your help.
What happens if you submit from the master node itself on ec2 (in client
mode), does that work? What about in cluster mode?
If I SSH to the machine with Spark master, then everything works - shell, and
regular submit in both client and cluster
That will produce very different output than just the 10 items that Manas
wants.
This is essentially a Scala shell issue, so this should apply:
http://stackoverflow.com/questions/9516567/settings-maxprintstring-for-scala-2-9-repl
On Mon, Feb 23, 2015 at 10:25 AM, Akhil Das
Cool, we will start from there. Thanks Aaron and Josh!
Darin, it¹s likely because the DirectOutputCommitter is compiled with
Hadoop 1 classes and you¹re running it with Hadoop 2.
org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and it
became an interface in Hadoop 2.
Mingyu
If you call reduceByKey(), internally Spark will introduce a shuffle
operations, not matter the data is already partitioned locally, Spark itself do
not know the data is already well partitioned.
So if you want to avoid Shuffle, you have to write the code explicitly to
avoid this, from my
You can use rdd.cartesian then find top-k by key to distribute the
work to executors. There is a trick to boost the performance: you need
to blockify user/product features and then use native matrix-matrix
multiplication. There is a relevant PR from Deb:
https://github.com/apache/spark/pull/3098 .
Did you try to use less number of partitions (user/product blocks)?
Did you use implicit feedback? In the current implementation, we only
do checkpointing with implicit feedback. We should adopt the
checkpoint strategy implemented in LDA:
Which Spark version did you use? Btw, there are three datasets from
MovieLens. The tutorial used the medium one (1 million). -Xiangrui
On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com wrote:
What do you mean?
--
View this message in context:
You could build a rest API, but you may have issue if you want to return
back arbitrary binary data. A more complex but robust alternative is to use
some RPC libraries like Akka, Thrift, etc.
TD
On Mon, Feb 23, 2015 at 12:45 AM, Nikhil Bafna nikhil.ba...@flipkart.com
wrote:
Tathagata - Yes,
Aaron. Thanks for the class. Since I'm currently writing Java based Spark
applications, I tried converting your class to Java (it seemed pretty
straightforward).
I set up the use of the class as follows:
SparkConf conf = new SparkConf()
.set(spark.hadoop.mapred.output.committer.class,
Hi All,
I am running a simple page rank program, but it is slow. And I dig out part
of reason is there is shuffle happen when I call an union action even both
RDD share the same partition:
Below is my test code in spark shell:
import org.apache.spark.HashPartitioner
Thanks for the suggestions.
I removed the persist call from program. Doing so I started it with:
spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn
/tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/
This takes all the default and only runs 2
Yes, we are going to expose the developer API. There was a long
discussion in the PR: https://github.com/apache/spark/pull/3637. So we
marked them package private and look for feedback on how to improve
it. Please implement your classes under `spark.ml` for now and let us
know your feedback.
Thanks. I think my problem might actually be the other way around.
I'm compiling with hadoop 2, but when I startup Spark, using the ec2 scripts,
I don't specify a
-hadoop-major-version and the default is 1. I'm guessing that if I make that
a 2 that it might work correctly. I'll try it and
In the book of learning spark:
So here it means only no shuffle happen crossing network but still will do
shuffle locally? Even it is the case, why union will trigger shuffle? I
think union will only just append the RDD together.
From: Shao, Saisai [mailto:saisai.s...@intel.com]
I think some RDD APIs like zipPartitions or others can do this as you wanted. I
might check the docs.
Thanks
Jerry
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, February 23, 2015 1:35 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey will trigger
I've no context of this book, AFAIK union will not trigger shuffle, as they
just put the partitions together, the operator reduceByKey() will actually
trigger shuffle.
Thanks
Jerry
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, February 23, 2015 12:26 PM
To: Shao, Saisai
Cc:
Hi Experts,
I am new in Spark, so I want manipulate it locally on my machine with
Ubuntu as OS.
I dowloaded the last version of Spark.
I ran this command to start it : ./sbin/start-master.sh
but an error is occured :
*starting org.apache.spark.deploy.master.Master, logging to
It sounds like you downloaded the source distribution perhaps, but
have not built it. That's what the message is telling you. See
http://spark.apache.org/docs/latest/building-spark.html
Or maybe you intended to get a binary distribution.
On Mon, Feb 23, 2015 at 10:40 PM, King sami
I guess you downloaded the source code.
You can build it with the following command:
mvn -DskipTests clean package
Or just download a compiled version.
Shlomi
On 24 בפבר׳ 2015, at 00:40, King sami kgsam...@gmail.com wrote:
Hi Experts,
I am new in Spark, so I want manipulate it locally
In your log it says:
pickle.PicklingError: Can't pickle type 'thread.lock': it's not found as
thread.lock
As far as I know, you can't pickle Spark models. If you go to the
documentation for Pickle you can see that you can pickle only simple Python
structures and code (written in Python), at
I think you're getting tripped up lazy evaluation and the way stage
boundaries work (admittedly its pretty confusing in this case).
It is true that up until recently, if you unioned two RDDs with the same
partitioner, the result did not have the same partitioner. But that was
just fixed here:
1. The RSME varies a little bit between the versions.
2. Partitioned the training,validation,test set like so:
- training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6)
- validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and
(x[3] % 10) 8)
- test =
I've got the opposite problem with regards to partitioning. I've got over
6000 partitions for some of these RDDs which immediately blows the heap
somehow- I'm still not exactly sure how. If I coalesce them down to about
600-800 partitions, I get the problems where the executors are dying
without
Patrick,
I haven't changed the configs much. I just executed ec2-script to create 1
master, 2 slaves cluster. Then I try to submit the jobs from remote machine
leaving all defaults configured by Spark scripts as default. I've tried to
change configs as suggested in other mailing-list and stack
Hello Todd,
Thank you for your suggestion! I have first tried increasing the Driver
memory to 2G and it worked without any problems, but I will also test with
the parameters and values you've shared.
Kind regards,
Emre Sevinç
http://www.bigindustries.be/
On Fri, Feb 20, 2015 at 3:25 PM, Todd
Hi Akhil,
I had installed spark on hadoop cluster itself.All of my clusters are on
the same network.
Thanks,
Sandeep.v
On Mon, Feb 23, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
If you are having both the clusters on the same network, then i'd suggest
you installing it on
You could do something like this.
def rddTrasnformationUsingBroadcast(rdd: RDD[...]): RDD[...] = {
val broadcastToUse = getBroadcast()// get the reference to a
broadcast variable, new or existing.
rdd.map { .. } // use broadcast variable
}
Could you find the executor logs on the executor where that task was
scheduled? That may provide more information on what caused the error.
Also take a look at where the block in question was stored, and where the
task was scheduled.
You will need to enabled log4j INFO level logs for this
Hi ,
To simplify my problem -
I have 2 files from which I reading words.
the o/p is like
file 1
aaa 4
bbb 6
ddd 3
file 2
ddd 2
bbb 6
ttt 5
if I do file1.join(file2)
I get (ddd(3,2)
bbb(6,6)
If I want to sort the output by the number of occurances of the word i file1
. How do I
I reached a similar conclusion about checkpointing . It requires your
entire computation to be serializable, even all of the 'local' bits.
Which makes sense. In my case I do not use checkpointing and it is
fine to restart the driver in the case of failure and not try to
recover its state.
What I
What happens if you submit from the master node itself on ec2 (in
client mode), does that work? What about in cluster mode?
It would be helpful if you could print the full command that the
executor is failing. That might show that spark.driver.host is being
set strangely. IIRC we print the launch
Sean,
thanks for your message!
On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen so...@cloudera.com wrote:
What I haven't investigated is whether you can enable checkpointing
for the state in updateStateByKey separately from this mechanism,
which is exactly your question. What happens if you set a
Just make sure you meet the following:
1. Set spark.driver.host to your local ip (Where you runs your code, and it
should be accessible from the cluster)
2. Make sure no firewall/router configurations are blocking/filtering the
connection between your laptop and the cluster. Best way to test
You will have a build a split infrastructure - a front end that takes the
queries from the UI and sends them to the backend, and the backend (running
the Spark Streaming app) will actually run the queries on table created in
the contexts. The RPCs necessary between the frontend and backend will
Tathagata - Yes, I'm thinking on that line.
The problem is how to send to send the query to the backend? Bundle a http
server into a spark streaming job, that will accept the parameters?
--
Nikhil Bafna
On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com wrote:
You will have a
Hi guys,
does spark streaming supports window operations on a sliding window that is
data is larger than the available memory?
we would like to
currently we are using kafka as input, but we could change that if needed.
thanks
Avi
--
View this message in context:
I was speaking about 1.2 version of spark
Paolo
Da: Paolo Plattermailto:paolo.plat...@agilelab.it
Data invio: ?luned?? ?23? ?febbraio? ?2015 ?10?:?41
A: user@spark.apache.orgmailto:user@spark.apache.org
Hi guys,
Is the IN operator supported in Spark SQL over Hive Metastore ?
Thanks
Paolo
Hi guys,
Is the “IN” operator supported in Spark SQL over Hive Metastore ?
Thanks
Paolo
Hi,
I'm having problems using pipe() from a Spark program written in Java,
where I call a python script, running in a YARN cluster. The problem is
that the job fails when YARN kills the container because the python script
is going beyond the memory limits. I get something like this in the log:
Hi Team,
I was trying to save a DecisionTree model from Pyspark using joblib.
It is giving me the following error http://pastebin.com/82CFhPNn . Any clue
how to resolve the same or save a model.
Best regards
Jagan
--
View this message in context:
(Move to user list.)
Hi Kannan,
You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is
this line of code
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
which overrides |spark.default.parallelism|. Also,
Hello,
I am trying to deserialize some data encoded using proto buff from within
Spark and am getting class-not-found exceptions. I have narrowed the
program down to something very simple that shows the problem exactly (see
'The Program' below) and hopefully someone can tell me the easy fix :)
Try to set lambda to 0.1. -Xiangrui
On Mon, Feb 23, 2015 at 3:06 PM, Krishna Sankar ksanka...@gmail.com wrote:
The RSME varies a little bit between the versions.
Partitioned the training,validation,test set like so:
training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6)
validation =
FYI, in 1.3 we support save/load tree models in Scala and Java. We will add
save/load support to Python soon. -Xiangrui
On Mon, Feb 23, 2015 at 2:57 PM, Sebastián Ramírez
sebastian.rami...@senseta.com wrote:
In your log it says:
pickle.PicklingError: Can't pickle type 'thread.lock': it's not
I *think* this may have been related to the default memory overhead setting
being too low. I raised the value to 1G it and tried my job again but i had
to leave the office before it finished. It did get further but I'm not
exactly sure if that's just because i raised the memory. I'll see tomorrow-
I am crazy for frequent mail rejection so I create a new thread SMTP error,
DOT: 552 spam score (5.7) exceeded threshold
(FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_BL_SPAMCOP_NET,SPF_PASS
Hi Silvio and Ted
I know there is a configuration parameter to
Thanks both of you guys on this!
bit1...@163.com
From: Akhil Das
Date: 2015-02-24 12:58
To: Tathagata Das
CC: user; bit1129
Subject: Re: About FlumeUtils.createStream
I see, thanks for the clarification TD.
On 24 Feb 2015 09:56, Tathagata Das t...@databricks.com wrote:
Akhil, that is
bq. Caused by: java.lang.ClassNotFoundException: com.rick.reports.Reports$
SensorReports
Is Reports$SensorReports class in rick-processors-assembly-1.0.jar ?
Thanks
On Mon, Feb 23, 2015 at 8:43 PM, necro351 . necro...@gmail.com wrote:
Hello,
I am trying to deserialize some data encoded
I could trace where the problem is. If I run without any threads, it works
fine. When I allocate threads, I run into Not serializable problem. But, I
need to have threads in my code.
Any help please!!!
This is my code:
object SparkKart
{
def parseVector(line: String): Vector[Double] = {
The behvior is exactly what I expected. Thanks Akhil and Tathagata!
bit1...@163.com
From: Akhil Das
Date: 2015-02-24 13:32
To: bit1129
CC: Tathagata Das; user
Subject: Re: Re: About FlumeUtils.createStream
That depends on how many machines you have in your cluster. Say you have 6
workers and
You mean SPARK_WORKER_CORES in /conf/spark-env.sh?
On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com
wrote:
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
are slots for task threads. The slot count is configured by the num_cores
setting.
Hi, Akhil,Tathagata,
This leads me to another question ,For the Spark Streaming and Kafka
Integration, If there are more than one Receiver in the cluster, such as
val streams = (1 to 6).map ( _ = KafkaUtils.createStream(ssc, zkQuorum,
group, topicMap).map(_._2) ),
then these Receivers will
Hi all
I wanted to share some details about something I've been working on with
the folks on the ADAM project: performance instrumentation for Spark jobs.
We've added a module to the bdg-utils project (
https://github.com/bigdatagenomics/bdg-utils) to enable Spark users to
instrument RDD
[hadoop@hadoop bin]$ sh submit.log.streaming.kafka.complicated.sh
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Start to run MyKafkaWordCount
Exception in thread main java.net.ConnectException: Call From
hadoop.master/192.168.26.137 to hadoop.master:9000
Is your laptop behind a NAT ?
I got bitten by a similar issue and (I think) it was because I was behind a
NAT that did not forward the public ip back to my private ip unless the
connection originated from my private ip
cheers
On Tue, Feb 24, 2015 at 5:20 AM, Oleg Shirokikh o...@solver.com
1 - 100 of 124 matches
Mail list logo