Re: Get importerror when i run pyspark with ipython=1

2015-02-26 Thread Jey Kottalam
Hi Sourabh, could you try it with the stable 2.4 version of IPython? On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha sourabh.g...@hotmail.com wrote: http://apache-spark-user-list.1001560.n3.nabble.com/file/n21843/pyspark_error.jpg I get the above error when I try to run pyspark with the ipython

One of the executor not getting StopExecutor message

2015-02-26 Thread twinkle sachdeva
Hi, I am running a spark application on Yarn in cluster mode. One of my executor appears to be in hang state, for a long time, and gets finally killed by the driver. As compared to other executors, It have not received StopExecutor message from the driver. Here are the logs at the end of this

Global sequential access of elements in RDD

2015-02-26 Thread Wush Wu
Dear all, I want to implement some sequential algorithm on RDD. For example: val conf = new SparkConf() conf.setMaster(local[2]). setAppName(SequentialSuite) val sc = new SparkContext(conf) val rdd = sc. parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2). sortBy(x = x, true)

Re: Cartesian issue with user defined objects

2015-02-26 Thread Marco Gaido
Thanks, my issue was exactly that the function to extract the class from the file used the same object, by only changing it. Creating a new object for each item solved the issue. Thank you very much for your reply. Best regards. Il giorno 26/feb/2015, alle ore 22:25, Imran Rashid

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
You don’t need to know rdd dependencies to maximize dependencies. Internally the scheduler will construct the DAG and trigger the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
What confused me is the statement of The final result is that rdd1 is calculated twice.” Is it the expected behavior? Thanks. Zhan Zhang On Feb 26, 2015, at 3:03 PM, Sean Owen so...@cloudera.commailto:so...@cloudera.com wrote: To distill this a bit further, I don't think you actually want

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I think I'm getting more confused the longer this thread goes. So rdd1.dependencies provides immediate parents to rdd1. For now i'm going to walk my internal DAG from the root down and see where running the caching of siblings concurrently gets me. I still like your point, Sean, about trying to

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread francois . garillot
The short answer: count(), as the sum can be partially aggregated on the mappers. The long answer: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html — FG On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread Sean Owen
Those do quite different things. One counts the data; the other copies all of the data to the driver. The fastest way to materialize an RDD that I know of is foreachPartition(i = None) (or equivalent no-op VoidFunction in Java) On Thu, Feb 26, 2015 at 1:28 PM, Emre Sevinc emre.sev...@gmail.com

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread francois . garillot
Note I’m assuming you were going for the size of your RDD, meaning in the ‘collect’ alternative, you would go for a size() right after the collect(). If you were simply trying to materialize your RDD, Sean’s answer is more complete. — FG On Thu, Feb 26, 2015 at 2:33 PM, Emre Sevinc

can not submit job to spark in windows

2015-02-26 Thread sergunok
Hi! I downloaded and extracted Spark to local folder under windows 7 and have successfully played with it in pyspark interactive shell. BUT When I try to use spark-submit (for example: job-submit pi.py ) I get: C:\spark-1.2.1-bin-hadoop2.4\binspark-submit.cmd pi.py Using Spark's default log4j

Re: Creating hive table on spark ((ERROR))

2015-02-26 Thread Cheng Lian
You are using a Hive version which is not support by Spark SQL. Spark SQL 1.1.x and prior versions only support Hive 0.12.0. Spark SQL 1.2.0 supports Hive 0.12.0 or Hive 0.13.1. On 2/27/15 12:12 AM, sandeep vura wrote: Hi Cheng, Thanks the above issue has been resolved.I have configured

Re: custom inputformat serializable problem

2015-02-26 Thread Sean Owen
If WRFVariableText is a Text, then it implements Writable, not Java serialization. You can implement Serializable in your class, or consider reusing SerializableWritable in Spark (note it's a developer API). On Thu, Feb 26, 2015 at 4:03 PM, patcharee patcharee.thong...@uni.no wrote: Hi, I am

Cartesian issue with user defined objects

2015-02-26 Thread mrk91
Hello, I have an issue with the cartesian method. When I use it with the Java types everything is ok, but when I use it with RDD made of objects defined by me it has very strage behaviors which depends on whether the RDD is cached or not (you can see here

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Anyone can share any thoughts related to my questions? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Help me understand the partition, parallelism in Spark Date: Wed, 25 Feb 2015 21:58:55 -0500 Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand

Re: Creating hive table on spark ((ERROR))

2015-02-26 Thread sandeep vura
Oh Thanks for the clarification,I will try to downgrade hive. On Thu, Feb 26, 2015 at 9:44 PM, Cheng Lian lian.cs@gmail.com wrote: You are using a Hive version which is not support by Spark SQL. Spark SQL 1.1.x and prior versions only support Hive 0.12.0. Spark SQL 1.2.0 supports Hive

custom inputformat serializable problem

2015-02-26 Thread patcharee
Hi, I am using custom inputformat and recordreader. This custom recordreader has declaration: public class NetCDFRecordReader extends RecordReaderWRFIndex, WRFVariableText The WRFVariableText extends Text: public class WRFVariableText extends org.apache.hadoop.io.Text The WRFVariableText

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Sean Owen
I believe that's right, and is what I was getting at. yes the implicit formulation ends up implicitly including every possible interaction in its loss function, even unobserved ones. That could be the difference. This is mostly an academic question though. In practice, you have click-like data

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread Sean Owen
Yea we discussed this on the list a short while ago. The extra overhead of count() is pretty minimal. Still you could wrap this up as a utility method. There was even a proposal to add some 'materialize' method to RDD. PS you can make your Java a little less verbose by omitting throws Exception

Job submission via multiple threads

2015-02-26 Thread Kartheek.R
Hi, I just wrote an application that intends to submit its actions(jobs) via independent threads keeping in view of the point: Second, within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads, mentioned in:

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread Emre Sevinc
On Thu, Feb 26, 2015 at 4:20 PM, Sean Owen so...@cloudera.com wrote: Yea we discussed this on the list a short while ago. The extra overhead of count() is pretty minimal. Still you could wrap this up as a utility method. There was even a proposal to add some 'materialize' method to RDD. I

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread 163
oh my god, I think I understood... In my case, there are three kinds of user-item pairs: Display and click pair(positive pair) Display but no-click pair(negative pair) No-display pair(unobserved pair) Explicit ALS only consider the first and the second kinds But implicit ALS consider all the

Re: Considering Spark for large data elements

2015-02-26 Thread Jeffrey Jedele
Hi Rob, I fear your questions will be hard to answer without additional information about what kind of simulations you plan to do. int[r][c] basically means you have a matrix of integers? You could for example map this to a row-oriented RDD of integer-arrays or to a column oriented RDD of integer

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread Emre Sevinc
Hello Sean, Thank you for your advice. Based on your suggestion, I've modified the code into the following (and once again admired the easy (!) verbosity of Java compared to 'complex and hard to understand' brevity (!) of Scala): javaDStream.foreachRDD( new FunctionJavaRDDString,

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-26 Thread Patrick Varilly
By the way, the limitation of case classes to 22 parameters was removed in https://issues.scala-lang.org/browse/SI-7296 Scala 2.11 https://issues.scala-lang.org/browse/SI-7098 (there's some technical rough edge https://github.com/scala/scala/pull/2305 past 22 that you most likely will never run

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread 163
Thank you very much for your opinion:) In our case, maybe it 's dangerous to treat un-observed item as negative interaction(although we could give them small confidence, I think they are still incredible...) I will do more experiments and give you feedback:) Thank you;) 在

different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong
I’m using ALS with spark 1.0.0, the code should be: https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala I think the following two method should produce the same (or near) result: MatrixFactorizationModel model =

Re: Creating hive table on spark ((ERROR))

2015-02-26 Thread Cheng Lian
Seems that you are running Hive metastore over MySQL, but don’t have MySQL JDBC driver on classpath: Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH.

Spark Drill 1.2.1 - error

2015-02-26 Thread Jahagirdar, Madhu
All, We are getting the below error when we are using Drill JDBC driver with spark, please let us know what could be the issue. java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian cannot access its superclass io.netty.buffer.WrappedByteBuf at

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong
okay, I have brought this to the user@list I don’t think the negative pair should be omitted….. if the score of all of the pairs are 1.0, the result will be worse…I have tried… Best Regards, Sendong Li 在 2015年2月26日,下午10:07,Sean Owen so...@cloudera.com 写道: Yes, I mean, do not generate a

Stand-alone Spark on windows

2015-02-26 Thread Sergey Gerasimov
Hi! I downloaded Spark binaries unpacked and could successfully run pyspark shell and write and execute some code here BUT I failed with submitting stand-alone python scripts or jar files via spark-submit: spark-submit pi.py I always get exception stack trace with NullPointerException in

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread Emre Sevinc
Francois, Thank you for quickly verifying. Kind regards, Emre Sevinç On Thu, Feb 26, 2015 at 2:32 PM, francois.garil...@typesafe.com wrote: The short answer: count(), as the sum can be partially aggregated on the mappers. The long answer:

[ANNOUNCE] Apache MRQL 0.9.4-incubating released

2015-02-26 Thread Leonidas Fegaras
The Apache MRQL team is pleased to announce the release of Apache MRQL 0.9.4-incubating. This is our second Apache release. Apache MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, Spark, and Flink. The release

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Paweł Szulc
Correct me if I'm wrong, but he can actually run thus code without broadcasting the users map, however the code will be less efficient. czw., 26 lut 2015, 12:31 PM Sean Owen użytkownik so...@cloudera.com napisał: Yes, but there is no concept of executors 'deleting' an RDD. And you would want

Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread Emre Sevinc
Hello, I have a piece of code to force the materialization of RDDs in my Spark Streaming program, and I'm trying to understand which method is faster and has less memory consumption: javaDStream.foreachRDD(new FunctionJavaRDDString, Void() { @Override public Void call(JavaRDDString

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
Yes that's correct; it works but broadcasting would be more efficient. On Thu, Feb 26, 2015 at 1:20 PM, Paweł Szulc paul.sz...@gmail.com wrote: Correct me if I'm wrong, but he can actually run thus code without broadcasting the users map, however the code will be less efficient. czw., 26

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Sean Owen
+user On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen so...@cloudera.com wrote: I think I may have it backwards, and that you are correct to keep the 0 elements in train() in order to try to reproduce the same result. The second formulation is called 'weighted regularization' and is used for

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Xiangrui Meng
Lisen, did you use all m-by-n pairs during training? Implicit model penalizes unobserved ratings, while explicit model doesn't. -Xiangrui On Feb 26, 2015 6:26 AM, Sean Owen so...@cloudera.com wrote: +user On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen so...@cloudera.com wrote: I think I may

Apache Ignite vs Apache Spark

2015-02-26 Thread Ognen Duzlevski
Can someone with experience briefly share or summarize the differences between Ignite and Spark? Are they complementary? Totally unrelated? Overlapping? Seems like ignite has reached version 1.0, I have never heard of it until a few days ago and given what is advertised, it sounds pretty

Re: throughput in the web console?

2015-02-26 Thread Saiph Kappa
One more question: while processing the exact same batch I noticed that giving more CPUs to the worker does not decrease the duration of the batch. I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU the duration increased, but apart from that the values were pretty similar,

Re: Which OutputCommitter to use for S3?

2015-02-26 Thread Thomas Demoor
FYI. We're currently addressing this at the Hadoop level in https://issues.apache.org/jira/browse/HADOOP-9565 Thomas Demoor On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: Just to close the loop in case anyone runs into the same problem I had. By setting

Getting to proto buff classes in Spark Context

2015-02-26 Thread necro351 .
Hello everyone, We are trying to decode a message inside a Spark job that we receive from Kafka. The message is encoded using Proto Buff. The problem is when decoding we get class-not-found exceptions. We have tried remedies we found online in Stack Exchange and mail list archives but nothing

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-26 Thread Christophe Préaud
You can see this information in the yarn web UI using the configuration I provided in my former mail (click on the application id, then on logs; you will then be automatically redirected to the yarn history server UI). On 24/02/2015 19:49, Colin Kincaid Williams wrote: So back to my original

Re: throughput in the web console?

2015-02-26 Thread Saiph Kappa
By setting spark.eventLog.enabled to true it is possible to see the application UI after the application has finished its execution, however the Streaming tab is no longer visible. For measuring the duration of batches in the code I am doing something like this: «wordCharValues.foreachRDD(rdd = {

Re: Spark excludes fastutil dependencies we need

2015-02-26 Thread Marcelo Vanzin
On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner j...@cloudphysics.com wrote: So, should the userClassPathFirst flag work and there is a bug? Sorry for jumping in the middle of conversation (and probably missing some of it), but note that this option applies only to executors. If you're trying to

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Sean Owen
Ignite is the renaming of GridGain, if that helps. It's like Oracle Coherence, if that helps. These do share some similarities -- fault tolerant, in-memory, distributed processing. The pieces they're built on differ, the architecture differs, the APIs differ. So fairly different in particulars. I

Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Mike Trienis
Hi All, I have Spark Streaming setup to write data to a replicated MongoDB database and would like to understand if there would be any issues using the Reactive Mongo library to write directly to the mongoDB? My stack is Apache Spark sitting on top of Cassandra for the datastore, so my thinking

Iterating on RDDs

2015-02-26 Thread Vijayasarathy Kannan
Hi, I have the following use case. (1) I have an RDD of edges of a graph (say R). (2) do a groupBy on R (by say source vertex) and call a function F on each group. (3) collect the results from Fs and do some computation (4) repeat the above steps until some criteria is met In (2), the groups

spark streaming, batchinterval,windowinterval and window sliding interval difference

2015-02-26 Thread Hafiz Mujadid
Can somebody explain the difference between batchinterval,windowinterval and window sliding interval with example. If there is any real time use case of using these parameters? Thanks -- View this message in context:

Re: Creating hive table on spark ((ERROR))

2015-02-26 Thread sandeep vura
Hi Cheng, Thanks the above issue has been resolved.I have configured Remote metastore not Local metastore in Hive. While creating a table in sparksql another error reflecting on terminal . Below error is given below sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep_data/sales_pg.csv'

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-26 Thread Mukesh Jha
On Wed, Feb 25, 2015 at 8:09 PM, Mukesh Jha me.mukesh@gmail.com wrote: My application runs fine for ~3/4 hours and then hits this issue. On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hi Experts, My Spark Job is failing with below error. From the logs I

How to augment data to existing MatrixFactorizationModel?

2015-02-26 Thread anishm
I am a beginner to the world of Machine Learning and the usage of Apache Spark. I have followed the tutorial at https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html#augmenting-matrix-factors

Re: value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

2015-02-26 Thread Deepak Vohra
Sean, Would you kindly suggest on which forum, mailing list or issues to ask question about the AAS book? Or no such provision is made? regards, Deepak On Thu, 2/26/15, Sean Owen so...@cloudera.com wrote: Subject: Re: value foreach is not a member

Re: Getting to proto buff classes in Spark Context

2015-02-26 Thread Akshat Aranya
My guess would be that you are packaging too many things in your job, which is causing problems with the classpath. When your jar goes in first, you get the correct version of protobuf, but some other version of something else. When your jar goes in later, other things work, but protobuf breaks.

Re: NegativeArraySizeException when doing joins on skewed data

2015-02-26 Thread Imran Rashid
Hi Tristan, at first I thought you were just hitting another instance of https://issues.apache.org/jira/browse/SPARK-1391, but I actually think its entirely related to kryo. Would it be possible for you to try serializing your object using kryo, without involving spark at all? If you are

Augment more data to existing MatrixFactorization Model?

2015-02-26 Thread anishm
I am a beginner to the world of Machine Learning and the usage of Apache Spark. I have followed the tutorial at https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html#augmenting-matrix-factors

Re: how to map and filter in one step?

2015-02-26 Thread Sean Owen
You can flatMap: rdd.flatMap { in = if (condition(in)) { Some(transformation(in)) } else { None } } On Thu, Feb 26, 2015 at 6:39 PM, Crystal Xing crystalxin...@gmail.com wrote: Hi, I have a text file input and I want to parse line by line and map each line to another format. But

value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

2015-02-26 Thread Deepak Vohra
  Ch 6 listing from Advanced Analytics with Spark generates error. The listing is def plainTextToLemmas(text: String, stopWords: Set[String],pipeline: StanfordCoreNLP)    : Seq[String] = {    val doc = newAnnotation(text)    pipeline.annotate(doc)    val lemmas = newArrayBuffer[String]()    val

Re: how to map and filter in one step?

2015-02-26 Thread Crystal Xing
I see. The reason we can use flatmap to map to null but not using map to map to null is because flatmap supports map to zero and more but map only support 1-1 mapping? It seems Flatmap is more equivalent to haddop's map. Thanks, Zheng zhen On Thu, Feb 26, 2015 at 10:44 AM, Sean Owen

how to map and filter in one step?

2015-02-26 Thread Crystal Xing
Hi, I have a text file input and I want to parse line by line and map each line to another format. But at the same time, I want to filter out some lines I do not need. I wonder if there is a way to filter out those lines in the map function. Do I have to do two steps filter and map? In that

Re: throughput in the web console?

2015-02-26 Thread Tathagata Das
If you have one receiver, and you are doing only map-like operaitons then the process will primarily happen on one machine. To use all the machines, either receiver in parallel with multiple receivers, or spread out the computation by explicitly repartitioning the received streams

NullPointerException in TaskSetManager

2015-02-26 Thread gtinside
Hi , I am trying to run a simple hadoop job (that uses CassandraHadoopInputOutputWriter) on spark (v1.2 , Hadoop v 1.x) but getting NullPointerException in TaskSetManager WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost task 14.2 in stage 0.0 (TID 29,

Re: value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

2015-02-26 Thread Sean Owen
(Books on Spark are not produced by the Spark project, and this is not the right place to ask about them. This question was already answered offline, too.) On Thu, Feb 26, 2015 at 6:38 PM, Deepak Vohra dvohr...@yahoo.com.invalid wrote: Ch 6 listing from Advanced Analytics with Spark generates

Re: how to map and filter in one step?

2015-02-26 Thread Mark Hamstra
rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be pipelined into a single stage, so there generally isn't any need to complect the map and filter into a single function. Additionally, there is RDD#collect[U](f: PartialFunction[T, U])(implicit arg0: ClassTag[U]): RDD[U],

Missing tasks

2015-02-26 Thread Akshat Aranya
I am seeing a problem with a Spark job in standalone mode. Spark master's web interface shows a task RUNNING on a particular executor, but the logs of the executor do not show the task being ever assigned to it, that is, such a line is missing from the log: 15/02/25 16:53:36 INFO

Re: GroupByKey causing problem

2015-02-26 Thread Imran Rashid
Hi Tushar, The most scalable option is probably for you to consider doing some approximation. Eg., sample the first to come up with the bucket boundaries. Then you can assign data points to buckets without needing to do a full groupByKey. You could even have more passes which corrects any

[SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yana Kadiyska
Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom build from early December (commit 4259ca8dd12) which works fine, and Spark1.2-RC2. On the latter I get: jdbc:hive2://XXX.208:10001 select

Re: Cartesian issue with user defined objects

2015-02-26 Thread Imran Rashid
any chance your input RDD is being read from hdfs, and you are running into this issue (in the docs on SparkContext#hadoopFile): * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD or directly passing it to an

Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Mike Trienis
Hi All, I have Spark Streaming setup to write data to a replicated MongoDB database and would like to understand if there would be any issues using the Reactive Mongo library to write directly to the mongoDB? My stack is Apache Spark sitting on top of Cassandra for the datastore, so my thinking

How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1 = sparkContext.sequenceFile().cache() val rdd2 = rdd1.map() How would I tell programmatically without being the one who built rdd1 and rdd2 whether or not rdd2

Re: Iterating on RDDs

2015-02-26 Thread Imran Rashid
val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER) // or whatever persistence makes more sense for you ... while(true) { val res = grouped.flatMap(F) res.collect.foreach(func) if(criteria) break } On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I see the rdd.dependencies() function, does that include ALL the dependencies of an RDD? Is it safe to assume I can say rdd2.dependencies.contains(rdd1)? On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm given 2 RDDs and told to store them in a sequence file

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Okay I confirmed my suspicions of a hang. I made a request that stopped progressing, though the already-scheduled tasks had finished. I made a separate request that was small enough not to hang, and it kicked the hung job enough to finish. I think what's happening is that the scheduler or the

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Imran Rashid
Hi Yong, mostly correct except for: - Since we are doing reduceByKey, shuffling will happen. Data will be shuffled into 1000 partitions, as we have 1000 unique keys. no, you will not get 1000 partitions. Spark has to decide how many partitions to use before it even knows how many

Running spark function on parquet without sql

2015-02-26 Thread tridib
Hello Experts, In one of my projects we are having parquet files and we are using spark SQL to get our analytics. I am encountering situation where simple SQL is not getting me what I need or the complex SQL is not supported by Spark Sql. In scenarios like this I am able to get things done using

Re: How to tell if one RDD depends on another

2015-02-26 Thread Imran Rashid
no, it does not give you transitive dependencies. You'd have to walk the tree of dependencies yourself, but that should just be a few lines. On Thu, Feb 26, 2015 at 3:32 PM, Corey Nolet cjno...@gmail.com wrote: I see the rdd.dependencies() function, does that include ALL the dependencies of

Issue with deploye Driver in cluster mode

2015-02-26 Thread pankaj
Hi, I have 3 node spark cluster node1 , node2 and node 3 I running below command on node 1 for deploying driver /usr/local/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class com.fst.firststep.aggregator.FirstStepMessageProcessor --master spark://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7077

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
The data is small. The job is composed of many small stages. * I found that with fewer than 222 the problem exhibits. What will be gained by going higher? * Pushing up the parallelism only pushes up the boundary at which the system appears to hang. I'm worried about some sort of message loss or

Re: spark streaming: stderr does not roll

2015-02-26 Thread Jeffrey Jedele
So the summarize (I had a similar question): Spark's log4j per default is configured to log to the console? Those messages end up in the stderr files and the approach does not support rolling? If I configure log4j to log to files, how can I keep the folder structure? Should I use relative paths

Re: spark streaming: stderr does not roll

2015-02-26 Thread Sean Owen
I think that's up to you. You can make it log wherever you want, and have some control over how log4j names the rolled log files by configuring its file-based rolling appender. On Thu, Feb 26, 2015 at 10:05 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: So the summarize (I had a similar

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Thanks for the link. Unfortunately, I turned on rdd compression and nothing changed. I tried moving netty - nio and no change :( On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Not many that i know of, but i bumped into this one

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
No. That code is just Scala code executing on the driver. usersMap is a local object. This bit has nothing to do with Spark. Yes you would have to broadcast it to use it efficient in functions (not on the driver). On Thu, Feb 26, 2015 at 10:24 AM, Guillermo Ortiz konstt2...@gmail.com wrote: So,

How to read from hdfs using spark-shell in Intel hadoop?

2015-02-26 Thread MEETHU MATHEW
Hi, I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the commandmvn -Dhadoop.version=1.0.3 clean package and started  spark-shell and read a HDFS file using sc.textFile() and the exception is    WARN

Re: Executor lost with too many temp files

2015-02-26 Thread Marius Soutier
Yeah did that already (65k). We also disabled swapping and reduced the amount of memory allocated to Spark (available - 4). This seems to have resolved the situation. Thanks! On 26.02.2015, at 05:43, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Can you try increasing the ulimit

GroupByKey causing problem

2015-02-26 Thread Tushar Sharma
Hi, I am trying to apply binning to a large CSV dataset. Here are the steps I am taking: 1. Emit each value of CSV as (ColIndex,(RowIndex,value)) 2. Then I groupByKey (here ColumnIndex) and get all values of a particular index to one node, as I have to work on the collection of all values 3. I

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-26 Thread Cheng Lian
Could you check the Spark web UI for the number of tasks issued when the query is executed? I digged out |mapred.map.tasks| because I saw 2 tasks were issued. On 2/26/15 3:01 AM, Kannan Rajah wrote: Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0. -- Kannan

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Is there any potential problem from 1.1.1 to 1.2.1 with shuffle dependencies that produce no data? On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com wrote: The data is small. The job is composed of many small stages. * I found that with fewer than 222 the problem exhibits.

CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
I have a question, If I execute this code, val users = sc.textFile(/tmp/users.log).map(x = x.split(,)).map( v = (v(0), v(1))) val contacts = sc.textFile(/tmp/contacts.log).map(y = y.split(,)).map( v = (v(0), v(1))) val usersMap = contacts.collectAsMap() contacts.map(v = (v._1, (usersMap(v._1),

Re: Scheduler hang?

2015-02-26 Thread Akhil Das
Not many that i know of, but i bumped into this one https://issues.apache.org/jira/browse/SPARK-4516 Thanks Best Regards On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com wrote: Is there any potential problem from 1.1.1 to 1.2.1 with shuffle dependencies that produce no

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
No, it exists only on the driver, not the executors. Executors don't retain partitions unless they are supposed to be persisted. Generally, broadcasting a small Map to accomplish a join 'manually' is more efficient than a join, but you are right that this is mostly because joins usually involve

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
So, on my example, when I execute: val usersMap = contacts.collectAsMap() -- Map goes to the driver and just lives there in the beginning. contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect When I execute usersMap(v._1), Does driver has to send to the executorX the value which it needs? I

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
Isn't it contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect() executed in the executors? why is it executed in the driver? contacts are not a local object, right? 2015-02-26 11:27 GMT+01:00 Sean Owen so...@cloudera.com: No. That code is just Scala code executing on the driver. usersMap

Re: How to efficiently control concurrent Spark jobs

2015-02-26 Thread Jeffrey Jedele
So basically you have lots of small ML tasks you want to run concurrently? With I've used repartition and cache to store the sub-datasets on only one machine you mean that you reduced each RDD to have one partition only? Maybe you want to give the fair scheduler a try to get more of your tasks

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-26 Thread anamika gupta
Hi Patrick Thanks a ton for your in-depth answer. The compilation error is now resolved. Thanks a lot again !! On Thu, Feb 26, 2015 at 2:40 PM, Patrick Varilly patrick.vari...@dataminded.be wrote: Hi, Akhil, In your definition of sdp_d

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
Yes, in that code, usersMap has been serialized to every executor. I thought you were referring to accessing the copy in the driver. On Thu, Feb 26, 2015 at 10:47 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Isn't it contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect() executed in the

Re: Number of Executors per worker process

2015-02-26 Thread Jeffrey Jedele
Hi Spico, Yes, I think an executor core in Spark is basically a thread in a worker pool. It's recommended to have one executor core per physical core on your machine for best performance, but I think in theory you can create as many threads as your OS allows. For deployment: There seems to be

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
One last time to be sure I got it right, the executing sequence here goes like this?: val usersMap = contacts.collectAsMap() #The contacts RDD is collected by the executors and sent to the driver, the executors delete the rdd contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect() #The userMap

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
Yes, but there is no concept of executors 'deleting' an RDD. And you would want to broadcast the usersMap if you're using it this way. On Thu, Feb 26, 2015 at 11:26 AM, Guillermo Ortiz konstt2...@gmail.com wrote: One last time to be sure I got it right, the executing sequence here goes like

RE: Spark cluster set up on EC2 customization

2015-02-26 Thread Sameer Tilak
Thanks! Date: Thu, 26 Feb 2015 12:51:21 +0530 Subject: Re: Spark cluster set up on EC2 customization From: ak...@sigmoidanalytics.com To: ssti...@live.com CC: user@spark.apache.org You can easily add a function (say setup_pig) inside the function setup_cluster in this scriptThanksBest Regards

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-26 Thread bit1...@163.com
Sure, Thanks Tathagata! bit1...@163.com From: Tathagata Das Date: 2015-02-26 14:47 To: bit1...@163.com CC: Akhil Das; user Subject: Re: Re: Many Receiver vs. Many threads per Receiver Spark Streaming has a new Kafka direct stream, to be release as experimental feature with 1.3. That uses a

  1   2   >