Storage Locations of an rdd

2014-12-26 Thread rapelly kartheek
Hi, I need to find the storage locations (node Ids ) of each partition of a replicated rdd in spark. I mean, if an rdd is replicated twice, I want to find the two nodes for each partition where it is stored. Spark WebUI has a page wherein it depicts the data distribution of each rdd. But, I

Re: Discourse: A proposed alternative to the Spark User list

2014-12-26 Thread Nicholas Chammas
Thanks for providing that additional background, Josh. It looks like many people on that Google Groups thread wanted a better interface than is offered by the Apache mailing lists. Some even raised the idea of a bi-directional bridge

RE: unable to do group by with 1st column

2014-12-26 Thread Sean Owen
This does not appear to be what the asker wanted as this makes one big string. groupByKey is correct after parsing to key value pairs. On Dec 26, 2014 3:55 AM, Somnath Pandeya somnath_pand...@infosys.com wrote: Hi , You can try reducebyKey also , Something like this JavaPairRDDString,

Re: Discourse: A proposed alternative to the Spark User list

2014-12-26 Thread Sean Owen
I like the idea and the hope that it turns 2+ places for discussions into 1, but in practice I think it will just turn it into 3+. The only thing I can imagine is making a tool like this an overlay. Does that require much integration work and does it affect anyone who can't use it? People won't

Re: unable to do group by with 1st column

2014-12-26 Thread Amit Behera
Hi, Thank you very much to all for your reply. I am able to get it by groupByKey Here is my code : import au.com.bytecode.opencsv.CSVParser val data = sc.textFile(/data/data.csv); def pLines(lines:Iterator[String])={ val parser=new CSVParser() lines.map(l={val vs=parser.parseLine(l)

Serious issues with class not found exceptions of classes in uber jar

2014-12-26 Thread critikaled
Hi, I m facing serious issues with spark application not recognizing the classes in uber jar some times it recognizes some time its does not. even adding external jars using setJars is not helping sometimes is any one else facing similar issue? Im using the latest 1.2.0 version. -- View this

Spark Streaming and Windows, it always counts the logs during all the windows. Why?

2014-12-26 Thread Guillermo Ortiz
I'm trying to make some operation with windows and intervals. I get data every15 seconds, and want to have a windows of 60 seconds with batch intervals of 15 seconds. I''m injecting data with ncat. if I inject 3 logs in the same interval I get into the do something each 15 secods during one

Re: Serious issues with class not found exceptions of classes in uber jar

2014-12-26 Thread Akhil Das
instead of setJars, you could try addJar and see if the issue still exists. Thanks Best Regards On Fri, Dec 26, 2014 at 3:26 PM, critikaled isasmani@gmail.com wrote: Hi, I m facing serious issues with spark application not recognizing the classes in uber jar some times it recognizes some

Re: Serious issues with class not found exceptions of classes in uber jar

2014-12-26 Thread critikaled
this out put from std err will help? Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/12/26 10:13:44 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/12/26 10:13:44 WARN NativeCodeLoader: Unable to load native-hadoop library

Re: serialization issue with mapPartitions

2014-12-26 Thread Akhil
You cannot pass your jobConf object inside any of the transformation function in spark (like map, mapPartitions, etc.) since org.apache.hadoop.mapreduce.Job is not Serializable. You can use KryoSerializer (See this doc http://spark.apache.org/docs/latest/tuning.html#data-serialization), We

Re: Spark Streaming and Windows, it always counts the logs during all the windows. Why?

2014-12-26 Thread Guillermo Ortiz
I'm trying to understand why it's not working and I typed some println to check what the code was executing.. def ruleSqlInjection(lines: ReceiverInputDStream[String]) = { println(1); //Just one time, when I start the program val filterSql = lines.filter(line =

Re: Spark Streaming and Windows, it always counts the logs during all the windows. Why?

2014-12-26 Thread Guillermo Ortiz
Oh, I didn't understand what I was doing, my fault (too much parties these xmas). Thought windows works in another weird way. Sorry for the questions.. 2014-12-26 13:42 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I'm trying to understand why it's not working and I typed some println to

Storage Locations of an rdd

2014-12-26 Thread rapelly kartheek
Hi, I need to find the storage locations (node Ids ) of each partition of a replicated rdd in spark. I mean, if an rdd is replicated twice, I want to find the two nodes for each partition where it is stored. Spark WebUI has a page wherein it depicts the data distribution of each rdd. But, I need

Re: Escape commas in file names

2014-12-26 Thread Daniel Siegmann
Thanks for the replies. Hopefully this will not be too difficult to fix. Why not support multiple paths by overloading the parquetFile method to take a collection of strings? That way we don't need an appropriate delimiter. On Thu, Dec 25, 2014 at 3:46 AM, Cheng, Hao hao.ch...@intel.com wrote:

Re: unable to do group by with 1st column

2014-12-26 Thread Michael Albert
Greetings! I'm trying to do something similar, and having a very bad time of it. What I start with is key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)key2: (col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...) What I want  (what I have been asked to produce

Re: unable to do group by with 1st column

2014-12-26 Thread Sean Owen
Here is a sketch of what you need to do off the top of my head and based on a guess of what your RDD is like: val in: RDD[(K,Seq[(C,V)])] = ... in.flatMap { case (key, colVals) = colVals.map { case (col, val) = (col, (key, val)) } }.groupByKey So the problem with both input and output

Re: Using the DataStax Cassandra Connector from PySpark

2014-12-26 Thread Stephen Boesch
Did you receive any response on this? I am trying to load hbase classes and getting the same error py4j.protocol.Py4JError: Trying to call a package. . Even though the $HBASE_HOME/lib/* had already been added to the compute-classpath.sh 2014-10-21 16:02 GMT-07:00 Mike Sukmanowsky

Re: how to do incremental model updates using spark streaming and mllib

2014-12-26 Thread Reza Zadeh
As of Spark 1.2 you can do Streaming k-means, see examples here: http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1 Best, Reza On Fri, Dec 26, 2014 at 1:36 AM, vishnu johnfedrickena...@gmail.com wrote: Hi, Say I have created a clustering model using KMeans for 100million

Re: How to build Spark against the latest

2014-12-26 Thread Ted Yu
In case jdk 1.7 or higher is used to build, --skip-java-test needs to be specifed. FYI On Thu, Dec 25, 2014 at 5:03 PM, guxiaobo1982 guxiaobo1...@qq.com wrote: The following command works ./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive

Can't submit the SparkPi example to local Yarn 2.6.0 installed by ambari 1.7.0

2014-12-26 Thread guxiaobo1982
Hi,I build the 1.2.0 version of spark against single node hadoop 2.6.0 installed by ambari 1.7.0, the ./bin/run-example SparkPi 10 command can execute on my local Mac 10.9.5 and the centos virtual machine, which host hadoop, but I can't run the SparkPi example inside yarn, it seems there's

Compile error from Spark 1.2.0

2014-12-26 Thread Zigen Zigen
Hello , I am zigen. I am using the Spark SQL 1.1.0. I want to use the Spark SQL 1.2.0. but my Spark application is a compile error. Spark 1.1.0 had a DataType.DecimalType. but Spark1.2.0 had not DataType.DecimalType. Why ? JavaDoc (Spark 1.1.0)