Hi Sourabh, could you try it with the stable 2.4 version of IPython?
On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha sourabh.g...@hotmail.com wrote:
http://apache-spark-user-list.1001560.n3.nabble.com/file/n21843/pyspark_error.jpg
I get the above error when I try to run pyspark with the ipython
Hi,
I am running a spark application on Yarn in cluster mode.
One of my executor appears to be in hang state, for a long time, and gets
finally killed by the driver.
As compared to other executors, It have not received StopExecutor message
from the driver.
Here are the logs at the end of this
Dear all,
I want to implement some sequential algorithm on RDD.
For example:
val conf = new SparkConf()
conf.setMaster(local[2]).
setAppName(SequentialSuite)
val sc = new SparkContext(conf)
val rdd = sc.
parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2).
sortBy(x = x, true)
Thanks,
my issue was exactly that the function to extract the class from the file used
the same object, by only changing it. Creating a new object for each item
solved the issue.
Thank you very much for your reply.
Best regards.
Il giorno 26/feb/2015, alle ore 22:25, Imran Rashid
You don’t need to know rdd dependencies to maximize dependencies. Internally
the scheduler will construct the DAG and trigger the execution if there is no
shuffle dependencies in between RDDs.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote:
Let's say I'm
What confused me is the statement of The final result is that rdd1 is
calculated twice.” Is it the expected behavior?
Thanks.
Zhan Zhang
On Feb 26, 2015, at 3:03 PM, Sean Owen
so...@cloudera.commailto:so...@cloudera.com wrote:
To distill this a bit further, I don't think you actually want
I think I'm getting more confused the longer this thread goes. So
rdd1.dependencies provides immediate parents to rdd1. For now i'm going to
walk my internal DAG from the root down and see where running the caching
of siblings concurrently gets me.
I still like your point, Sean, about trying to
The short answer:
count(), as the sum can be partially aggregated on the mappers.
The long answer:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html
—
FG
On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc
Those do quite different things. One counts the data; the other copies
all of the data to the driver.
The fastest way to materialize an RDD that I know of is
foreachPartition(i = None) (or equivalent no-op VoidFunction in
Java)
On Thu, Feb 26, 2015 at 1:28 PM, Emre Sevinc emre.sev...@gmail.com
Note I’m assuming you were going for the size of your RDD, meaning in the
‘collect’ alternative, you would go for a size() right after the collect().
If you were simply trying to materialize your RDD, Sean’s answer is more
complete.
—
FG
On Thu, Feb 26, 2015 at 2:33 PM, Emre Sevinc
Hi!
I downloaded and extracted Spark to local folder under windows 7 and have
successfully played with it in pyspark interactive shell.
BUT
When I try to use spark-submit (for example: job-submit pi.py ) I get:
C:\spark-1.2.1-bin-hadoop2.4\binspark-submit.cmd pi.py
Using Spark's default log4j
You are using a Hive version which is not support by Spark SQL. Spark
SQL 1.1.x and prior versions only support Hive 0.12.0. Spark SQL 1.2.0
supports Hive 0.12.0 or Hive 0.13.1.
On 2/27/15 12:12 AM, sandeep vura wrote:
Hi Cheng,
Thanks the above issue has been resolved.I have configured
If WRFVariableText is a Text, then it implements Writable, not Java
serialization. You can implement Serializable in your class, or
consider reusing SerializableWritable in Spark (note it's a developer
API).
On Thu, Feb 26, 2015 at 4:03 PM, patcharee patcharee.thong...@uni.no wrote:
Hi,
I am
Hello,
I have an issue with the cartesian method. When I use it with the Java types
everything is ok, but when I use it with RDD made of objects defined by me
it has very strage behaviors which depends on whether the RDD is cached or
not (you can see here
Anyone can share any thoughts related to my questions?
Thanks
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Help me understand the partition, parallelism in Spark
Date: Wed, 25 Feb 2015 21:58:55 -0500
Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand
Oh Thanks for the clarification,I will try to downgrade hive.
On Thu, Feb 26, 2015 at 9:44 PM, Cheng Lian lian.cs@gmail.com wrote:
You are using a Hive version which is not support by Spark SQL. Spark SQL
1.1.x and prior versions only support Hive 0.12.0. Spark SQL 1.2.0 supports
Hive
Hi,
I am using custom inputformat and recordreader. This custom recordreader
has declaration:
public class NetCDFRecordReader extends RecordReaderWRFIndex,
WRFVariableText
The WRFVariableText extends Text:
public class WRFVariableText extends org.apache.hadoop.io.Text
The WRFVariableText
I believe that's right, and is what I was getting at. yes the implicit
formulation ends up implicitly including every possible interaction in
its loss function, even unobserved ones. That could be the difference.
This is mostly an academic question though. In practice, you have
click-like data
Yea we discussed this on the list a short while ago. The extra
overhead of count() is pretty minimal. Still you could wrap this up as
a utility method. There was even a proposal to add some 'materialize'
method to RDD.
PS you can make your Java a little less verbose by omitting throws
Exception
Hi,
I just wrote an application that intends to submit its actions(jobs) via
independent threads keeping in view of the point: Second, within each
Spark application, multiple “jobs” (Spark actions) may be running
concurrently if they were submitted by different threads, mentioned in:
On Thu, Feb 26, 2015 at 4:20 PM, Sean Owen so...@cloudera.com wrote:
Yea we discussed this on the list a short while ago. The extra
overhead of count() is pretty minimal. Still you could wrap this up as
a utility method. There was even a proposal to add some 'materialize'
method to RDD.
I
oh my god, I think I understood...
In my case, there are three kinds of user-item pairs:
Display and click pair(positive pair)
Display but no-click pair(negative pair)
No-display pair(unobserved pair)
Explicit ALS only consider the first and the second kinds
But implicit ALS consider all the
Hi Rob,
I fear your questions will be hard to answer without additional information
about what kind of simulations you plan to do. int[r][c] basically means
you have a matrix of integers? You could for example map this to a
row-oriented RDD of integer-arrays or to a column oriented RDD of integer
Hello Sean,
Thank you for your advice. Based on your suggestion, I've modified the code
into the following (and once again admired the easy (!) verbosity of Java
compared to 'complex and hard to understand' brevity (!) of Scala):
javaDStream.foreachRDD(
new FunctionJavaRDDString,
By the way, the limitation of case classes to 22 parameters was removed in
https://issues.scala-lang.org/browse/SI-7296 Scala 2.11
https://issues.scala-lang.org/browse/SI-7098 (there's some technical
rough edge https://github.com/scala/scala/pull/2305 past 22 that you most
likely will never run
Thank you very much for your opinion:)
In our case, maybe it 's dangerous to treat un-observed item as negative
interaction(although we could give them small confidence, I think they are
still incredible...)
I will do more experiments and give you feedback:)
Thank you;)
在
I’m using ALS with spark 1.0.0, the code should be:
https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
I think the following two method should produce the same (or near) result:
MatrixFactorizationModel model =
Seems that you are running Hive metastore over MySQL, but don’t have
MySQL JDBC driver on classpath:
Caused by:
org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException:
The specified datastore driver (“com.mysql.jdbc.Driver”) was not
found in the CLASSPATH.
All,
We are getting the below error when we are using Drill JDBC driver with spark,
please let us know what could be the issue.
java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian
cannot access its superclass io.netty.buffer.WrappedByteBuf
at
okay, I have brought this to the user@list
I don’t think the negative pair should be omitted…..
if the score of all of the pairs are 1.0, the result will be worse…I have tried…
Best Regards,
Sendong Li
在 2015年2月26日,下午10:07,Sean Owen so...@cloudera.com 写道:
Yes, I mean, do not generate a
Hi!
I downloaded Spark binaries unpacked and could successfully run pyspark shell
and write and execute some code here
BUT
I failed with submitting stand-alone python scripts or jar files via
spark-submit:
spark-submit pi.py
I always get exception stack trace with NullPointerException in
Francois,
Thank you for quickly verifying.
Kind regards,
Emre Sevinç
On Thu, Feb 26, 2015 at 2:32 PM, francois.garil...@typesafe.com wrote:
The short answer:
count(), as the sum can be partially aggregated on the mappers.
The long answer:
The Apache MRQL team is pleased to announce the release of
Apache MRQL 0.9.4-incubating. This is our second Apache release.
Apache MRQL is a query processing and optimization system for
large-scale, distributed data analysis, built on top of
Apache Hadoop, Hama, Spark, and Flink.
The release
Correct me if I'm wrong, but he can actually run thus code without
broadcasting the users map, however the code will be less efficient.
czw., 26 lut 2015, 12:31 PM Sean Owen użytkownik so...@cloudera.com
napisał:
Yes, but there is no concept of executors 'deleting' an RDD. And you
would want
Hello,
I have a piece of code to force the materialization of RDDs in my Spark
Streaming program, and I'm trying to understand which method is faster and
has less memory consumption:
javaDStream.foreachRDD(new FunctionJavaRDDString, Void() {
@Override
public Void call(JavaRDDString
Yes that's correct; it works but broadcasting would be more efficient.
On Thu, Feb 26, 2015 at 1:20 PM, Paweł Szulc paul.sz...@gmail.com wrote:
Correct me if I'm wrong, but he can actually run thus code without
broadcasting the users map, however the code will be less efficient.
czw., 26
+user
On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen so...@cloudera.com wrote:
I think I may have it backwards, and that you are correct to keep the 0
elements in train() in order to try to reproduce the same result.
The second formulation is called 'weighted regularization' and is used for
Lisen, did you use all m-by-n pairs during training? Implicit model
penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
On Feb 26, 2015 6:26 AM, Sean Owen so...@cloudera.com wrote:
+user
On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen so...@cloudera.com wrote:
I think I may
Can someone with experience briefly share or summarize the differences
between Ignite and Spark? Are they complementary? Totally unrelated?
Overlapping? Seems like ignite has reached version 1.0, I have never heard
of it until a few days ago and given what is advertised, it sounds pretty
One more question: while processing the exact same batch I noticed that
giving more CPUs to the worker does not decrease the duration of the batch.
I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU
the duration increased, but apart from that the values were pretty similar,
FYI. We're currently addressing this at the Hadoop level in
https://issues.apache.org/jira/browse/HADOOP-9565
Thomas Demoor
On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:
Just to close the loop in case anyone runs into the same problem I had.
By setting
Hello everyone,
We are trying to decode a message inside a Spark job that we receive from
Kafka. The message is encoded using Proto Buff. The problem is when
decoding we get class-not-found exceptions. We have tried remedies we found
online in Stack Exchange and mail list archives but nothing
You can see this information in the yarn web UI using the configuration I
provided in my former mail (click on the application id, then on logs; you will
then be automatically redirected to the yarn history server UI).
On 24/02/2015 19:49, Colin Kincaid Williams wrote:
So back to my original
By setting spark.eventLog.enabled to true it is possible to see the
application UI after the application has finished its execution, however
the Streaming tab is no longer visible.
For measuring the duration of batches in the code I am doing something like
this:
«wordCharValues.foreachRDD(rdd = {
On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner j...@cloudphysics.com wrote:
So, should the userClassPathFirst flag work and there is a bug?
Sorry for jumping in the middle of conversation (and probably missing
some of it), but note that this option applies only to executors. If
you're trying to
Ignite is the renaming of GridGain, if that helps. It's like Oracle
Coherence, if that helps. These do share some similarities -- fault
tolerant, in-memory, distributed processing. The pieces they're built
on differ, the architecture differs, the APIs differ. So fairly
different in particulars. I
Hi All,
I have Spark Streaming setup to write data to a replicated MongoDB database
and would like to understand if there would be any issues using the Reactive
Mongo library to write directly to the mongoDB? My stack is Apache Spark
sitting on top of Cassandra for the datastore, so my thinking
Hi,
I have the following use case.
(1) I have an RDD of edges of a graph (say R).
(2) do a groupBy on R (by say source vertex) and call a function F on each
group.
(3) collect the results from Fs and do some computation
(4) repeat the above steps until some criteria is met
In (2), the groups
Can somebody explain the difference between
batchinterval,windowinterval and window sliding interval with example.
If there is any real time use case of using these parameters?
Thanks
--
View this message in context:
Hi Cheng,
Thanks the above issue has been resolved.I have configured Remote metastore
not Local metastore in Hive.
While creating a table in sparksql another error reflecting on terminal .
Below error is given below
sqlContext.sql(LOAD DATA LOCAL INPATH
'/home/spark12/sandeep_data/sales_pg.csv'
On Wed, Feb 25, 2015 at 8:09 PM, Mukesh Jha me.mukesh@gmail.com wrote:
My application runs fine for ~3/4 hours and then hits this issue.
On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Hi Experts,
My Spark Job is failing with below error.
From the logs I
I am a beginner to the world of Machine Learning and the usage of Apache
Spark.
I have followed the tutorial at
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html#augmenting-matrix-factors
Sean,
Would you kindly suggest on which forum, mailing list or issues to ask question
about the AAS book?
Or no such provision is made?
regards,
Deepak
On Thu, 2/26/15, Sean Owen so...@cloudera.com wrote:
Subject: Re: value foreach is not a member
My guess would be that you are packaging too many things in your job, which
is causing problems with the classpath. When your jar goes in first, you
get the correct version of protobuf, but some other version of something
else. When your jar goes in later, other things work, but protobuf
breaks.
Hi Tristan,
at first I thought you were just hitting another instance of
https://issues.apache.org/jira/browse/SPARK-1391, but I actually think its
entirely related to kryo. Would it be possible for you to try serializing
your object using kryo, without involving spark at all? If you are
I am a beginner to the world of Machine Learning and the usage of Apache
Spark.
I have followed the tutorial at
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html#augmenting-matrix-factors
You can flatMap:
rdd.flatMap { in =
if (condition(in)) {
Some(transformation(in))
} else {
None
}
}
On Thu, Feb 26, 2015 at 6:39 PM, Crystal Xing crystalxin...@gmail.com wrote:
Hi,
I have a text file input and I want to parse line by line and map each line
to another format. But
Ch 6 listing from Advanced Analytics with Spark generates error. The listing
is
def plainTextToLemmas(text: String, stopWords: Set[String],pipeline:
StanfordCoreNLP) : Seq[String] = { val doc = newAnnotation(text)
pipeline.annotate(doc) val lemmas = newArrayBuffer[String]() val
I see.
The reason we can use flatmap to map to null but not using map to map to
null is because
flatmap supports map to zero and more but map only support 1-1 mapping?
It seems Flatmap is more equivalent to haddop's map.
Thanks,
Zheng zhen
On Thu, Feb 26, 2015 at 10:44 AM, Sean Owen
Hi,
I have a text file input and I want to parse line by line and map each line
to another format. But at the same time, I want to filter out some lines I
do not need.
I wonder if there is a way to filter out those lines in the map function.
Do I have to do two steps filter and map? In that
If you have one receiver, and you are doing only map-like operaitons then
the process will primarily happen on one machine. To use all the machines,
either receiver in parallel with multiple receivers, or spread out the
computation by explicitly repartitioning the received streams
Hi ,
I am trying to run a simple hadoop job (that uses
CassandraHadoopInputOutputWriter) on spark (v1.2 , Hadoop v 1.x) but getting
NullPointerException in TaskSetManager
WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost
task 14.2 in stage 0.0 (TID 29,
(Books on Spark are not produced by the Spark project, and this is not
the right place to ask about them. This question was already answered
offline, too.)
On Thu, Feb 26, 2015 at 6:38 PM, Deepak Vohra
dvohr...@yahoo.com.invalid wrote:
Ch 6 listing from Advanced Analytics with Spark generates
rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be
pipelined into a single stage, so there generally isn't any need to
complect the map and filter into a single function.
Additionally, there is RDD#collect[U](f: PartialFunction[T, U])(implicit
arg0: ClassTag[U]): RDD[U],
I am seeing a problem with a Spark job in standalone mode. Spark master's
web interface shows a task RUNNING on a particular executor, but the logs
of the executor do not show the task being ever assigned to it, that is,
such a line is missing from the log:
15/02/25 16:53:36 INFO
Hi Tushar,
The most scalable option is probably for you to consider doing some
approximation. Eg., sample the first to come up with the bucket
boundaries. Then you can assign data points to buckets without needing to
do a full groupByKey. You could even have more passes which corrects any
Can someone confirm if they can run UDFs in group by in spark1.2?
I have two builds running -- one from a custom build from early December
(commit 4259ca8dd12) which works fine, and Spark1.2-RC2.
On the latter I get:
jdbc:hive2://XXX.208:10001 select
any chance your input RDD is being read from hdfs, and you are running into
this issue (in the docs on SparkContext#hadoopFile):
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
object for each
* record, directly caching the returned RDD or directly passing it to an
Hi All,
I have Spark Streaming setup to write data to a replicated MongoDB database
and would like to understand if there would be any issues using the
Reactive Mongo library to write directly to the mongoDB? My stack is Apache
Spark sitting on top of Cassandra for the datastore, so my thinking
Let's say I'm given 2 RDDs and told to store them in a sequence file and
they have the following dependency:
val rdd1 = sparkContext.sequenceFile().cache()
val rdd2 = rdd1.map()
How would I tell programmatically without being the one who built rdd1 and
rdd2 whether or not rdd2
val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER)
// or whatever persistence makes more sense for you ...
while(true) {
val res = grouped.flatMap(F)
res.collect.foreach(func)
if(criteria)
break
}
On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan
I see the rdd.dependencies() function, does that include ALL the
dependencies of an RDD? Is it safe to assume I can say
rdd2.dependencies.contains(rdd1)?
On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet cjno...@gmail.com wrote:
Let's say I'm given 2 RDDs and told to store them in a sequence file
Okay I confirmed my suspicions of a hang. I made a request that stopped
progressing, though the already-scheduled tasks had finished. I made a
separate request that was small enough not to hang, and it kicked the hung
job enough to finish. I think what's happening is that the scheduler or the
Hi Yong,
mostly correct except for:
- Since we are doing reduceByKey, shuffling will happen. Data will be
shuffled into 1000 partitions, as we have 1000 unique keys.
no, you will not get 1000 partitions. Spark has to decide how many
partitions to use before it even knows how many
Hello Experts,
In one of my projects we are having parquet files and we are using spark SQL
to get our analytics. I am encountering situation where simple SQL is not
getting me what I need or the complex SQL is not supported by Spark Sql. In
scenarios like this I am able to get things done using
no, it does not give you transitive dependencies. You'd have to walk the
tree of dependencies yourself, but that should just be a few lines.
On Thu, Feb 26, 2015 at 3:32 PM, Corey Nolet cjno...@gmail.com wrote:
I see the rdd.dependencies() function, does that include ALL the
dependencies of
Hi,
I have 3 node spark cluster
node1 , node2 and node 3
I running below command on node 1 for deploying driver
/usr/local/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class
com.fst.firststep.aggregator.FirstStepMessageProcessor --master
spark://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7077
The data is small. The job is composed of many small stages.
* I found that with fewer than 222 the problem exhibits. What will be
gained by going higher?
* Pushing up the parallelism only pushes up the boundary at which the
system appears to hang. I'm worried about some sort of message loss or
So the summarize (I had a similar question):
Spark's log4j per default is configured to log to the console? Those
messages end up in the stderr files and the approach does not support
rolling?
If I configure log4j to log to files, how can I keep the folder structure?
Should I use relative paths
I think that's up to you. You can make it log wherever you want, and
have some control over how log4j names the rolled log files by
configuring its file-based rolling appender.
On Thu, Feb 26, 2015 at 10:05 AM, Jeffrey Jedele
jeffrey.jed...@gmail.com wrote:
So the summarize (I had a similar
Thanks for the link. Unfortunately, I turned on rdd compression and nothing
changed. I tried moving netty - nio and no change :(
On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Not many that i know of, but i bumped into this one
No. That code is just Scala code executing on the driver. usersMap is
a local object. This bit has nothing to do with Spark.
Yes you would have to broadcast it to use it efficient in functions
(not on the driver).
On Thu, Feb 26, 2015 at 10:24 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
So,
Hi,
I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is
1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the
commandmvn -Dhadoop.version=1.0.3 clean package and started spark-shell and
read a HDFS file using sc.textFile() and the exception is
WARN
Yeah did that already (65k). We also disabled swapping and reduced the amount
of memory allocated to Spark (available - 4). This seems to have resolved the
situation.
Thanks!
On 26.02.2015, at 05:43, Raghavendra Pandey raghavendra.pan...@gmail.com
wrote:
Can you try increasing the ulimit
Hi,
I am trying to apply binning to a large CSV dataset. Here are the steps I
am taking:
1. Emit each value of CSV as (ColIndex,(RowIndex,value))
2. Then I groupByKey (here ColumnIndex) and get all values of a particular
index to one node, as I have to work on the collection of all values
3. I
Could you check the Spark web UI for the number of tasks issued when the
query is executed? I digged out |mapred.map.tasks| because I saw 2 tasks
were issued.
On 2/26/15 3:01 AM, Kannan Rajah wrote:
Cheng, We tried this setting and it still did not help. This was on
Spark 1.2.0.
--
Kannan
Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
dependencies that produce no data?
On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com wrote:
The data is small. The job is composed of many small stages.
* I found that with fewer than 222 the problem exhibits.
I have a question,
If I execute this code,
val users = sc.textFile(/tmp/users.log).map(x = x.split(,)).map(
v = (v(0), v(1)))
val contacts = sc.textFile(/tmp/contacts.log).map(y =
y.split(,)).map( v = (v(0), v(1)))
val usersMap = contacts.collectAsMap()
contacts.map(v = (v._1, (usersMap(v._1),
Not many that i know of, but i bumped into this one
https://issues.apache.org/jira/browse/SPARK-4516
Thanks
Best Regards
On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com wrote:
Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
dependencies that produce no
No, it exists only on the driver, not the executors. Executors don't
retain partitions unless they are supposed to be persisted.
Generally, broadcasting a small Map to accomplish a join 'manually' is
more efficient than a join, but you are right that this is mostly
because joins usually involve
So, on my example, when I execute:
val usersMap = contacts.collectAsMap() -- Map goes to the driver and
just lives there in the beginning.
contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect
When I execute usersMap(v._1),
Does driver has to send to the executorX the value which it needs? I
Isn't it contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect()
executed in the executors? why is it executed in the driver?
contacts are not a local object, right?
2015-02-26 11:27 GMT+01:00 Sean Owen so...@cloudera.com:
No. That code is just Scala code executing on the driver. usersMap
So basically you have lots of small ML tasks you want to run concurrently?
With I've used repartition and cache to store the sub-datasets on only one
machine you mean that you reduced each RDD to have one partition only?
Maybe you want to give the fair scheduler a try to get more of your tasks
Hi Patrick
Thanks a ton for your in-depth answer. The compilation error is now
resolved.
Thanks a lot again !!
On Thu, Feb 26, 2015 at 2:40 PM, Patrick Varilly
patrick.vari...@dataminded.be wrote:
Hi, Akhil,
In your definition of sdp_d
Yes, in that code, usersMap has been serialized to every executor.
I thought you were referring to accessing the copy in the driver.
On Thu, Feb 26, 2015 at 10:47 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
Isn't it contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect()
executed in the
Hi Spico,
Yes, I think an executor core in Spark is basically a thread in a worker
pool. It's recommended to have one executor core per physical core on your
machine for best performance, but I think in theory you can create as many
threads as your OS allows.
For deployment:
There seems to be
One last time to be sure I got it right, the executing sequence here
goes like this?:
val usersMap = contacts.collectAsMap()
#The contacts RDD is collected by the executors and sent to the
driver, the executors delete the rdd
contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect()
#The userMap
Yes, but there is no concept of executors 'deleting' an RDD. And you
would want to broadcast the usersMap if you're using it this way.
On Thu, Feb 26, 2015 at 11:26 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
One last time to be sure I got it right, the executing sequence here
goes like
Thanks!
Date: Thu, 26 Feb 2015 12:51:21 +0530
Subject: Re: Spark cluster set up on EC2 customization
From: ak...@sigmoidanalytics.com
To: ssti...@live.com
CC: user@spark.apache.org
You can easily add a function (say setup_pig) inside the function setup_cluster
in this scriptThanksBest Regards
Sure, Thanks Tathagata!
bit1...@163.com
From: Tathagata Das
Date: 2015-02-26 14:47
To: bit1...@163.com
CC: Akhil Das; user
Subject: Re: Re: Many Receiver vs. Many threads per Receiver
Spark Streaming has a new Kafka direct stream, to be release as experimental
feature with 1.3. That uses a
1 - 100 of 157 matches
Mail list logo