Is there a way to clone a JavaRDD without persisting it

2014-11-11 Thread Steve Lewis
In my problem I have a number of intermediate JavaRDDs and would like to be able to look at their sizes without destroying the RDD for sibsequent processing. persist will do this but these are big and perisist seems expensive and I am unsure of which StorageLevel is needed, Is there a way to

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
Yes, your broadcast should be about 300M, much smaller than 2G, I didn't read your post carefully. The broadcast in Python had been improved much since 1.1, I think it will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1? Davies On Tue, Nov 11, 2014 at 8:37 PM, bliuab

How did the RDD.union work

2014-11-11 Thread qiaou
Hi: I got a problem with using the union method of RDD things like this I get a function like def hbaseQuery(area:string):RDD[Result]= ??? when i use hbaseQuery('aa').union(hbaseQuery(‘bb’)).count() it returns 0 however when use like this

Re: MLLIB usage: BLAS dependency warning

2014-11-11 Thread Xiangrui Meng
Could you try jar tf on the assembly jar and grep netlib-native_system-linux-x86_64.so? -Xiangrui On Tue, Nov 11, 2014 at 7:11 PM, jpl jlefe...@soe.ucsc.edu wrote: Hi, I am having trouble using the BLAS libs with the MLLib functions. I am using org.apache.spark.mllib.clustering.KMeans (on a

Re: spark-shell exception while running in YARN mode

2014-11-11 Thread hmxxyy
The Pi example gives same error in yarn mode HADOOP_CONF_DIR=/home/gs/conf/current ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client ../examples/target/spark-examples_2.10-1.2.0-SNAPSHOT.jar What could be wrong here? -- View this message in context:

Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
Could you provide the code of hbaseQuery? It maybe doesn't support to execute in parallel. Best Regards, Shixiong Zhu 2014-11-12 14:32 GMT+08:00 qiaou qiaou8...@gmail.com: Hi: I got a problem with using the union method of RDD things like this I get a function like def

Re: Imbalanced shuffle read

2014-11-11 Thread Akhil Das
When you calls the groupByKey() try providing the number of partitions like groupByKey(100) depending on your data/cluster size. Thanks Best Regards On Wed, Nov 12, 2014 at 6:45 AM, ankits ankitso...@gmail.com wrote: Im running a job that uses groupByKey(), so it generates a lot of shuffle

回复: How did the RDD.union work

2014-11-11 Thread qiaou
ok here is the code def hbaseQuery:(String)=RDD[Result] = { val generateRdd = (area:String)={ val startRowKey = s$area${RowKeyUtils.convertToHex(startId, 10)} val stopRowKey = s$area${RowKeyUtils.convertToHex(endId, 10)}

spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-11-11 Thread tridib
Hi Friends, I am trying to save a json file to parquet. I got error Unsupported datatype TimestampType. Is not parquet support date? Which parquet version does spark uses? Is there any work around? Here the stacktrace: java.lang.RuntimeException: Unsupported datatype TimestampType at

Re: groupBy for DStream

2014-11-11 Thread Akhil Das
1. Use foreachRDD over the dstream and on the each rdd you can call the groupBy() 2. DStream.count() Return a new DStream in which each RDD has a single element generated by counting each RDD of this DStream. Thanks Best Regards On Wed, Nov 12, 2014 at 2:49 AM, SK skrishna...@gmail.com wrote:

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu: Thank you for your replay. I will set up an experimental environment for spark-1.1 and test it. On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n1868...@n3.nabble.com wrote: Yes, your broadcast should be about 300M, much smaller than 2G, I

回复: How did the RDD.union work

2014-11-11 Thread qiaou
this work! but can you explain why should use like this? -- qiaou 已使用 Sparrow (http://www.sparrowmailapp.com/?sig) 在 2014年11月12日 星期三,下午3:18,Shixiong Zhu 写道: You need to create a new configuration for each RDD. Therefore, val hbaseConf = HBaseConfigUtil.getHBaseConfiguration should be

Re: ISpark class not found

2014-11-11 Thread MEETHU MATHEW
Hi, I was also trying Ispark..But I couldnt even start the notebook..I am getting the following error. ERROR:tornado.access:500 POST /api/sessions (127.0.0.1) 10.15ms referer=http://localhost:/notebooks/Scala/Untitled0.ipynb How did you start the notebook?  Thanks Regards, Meethu M

About Join operator in PySpark

2014-11-11 Thread 夏俊鸾
Hi all I have noticed that “Join” operator has been transferred to union and groupByKey operator instead of cogroup operator in PySpark, this change will probably generate more shuffle stage, for example rdd1 = sc.makeRDD(...).partitionBy(2) rdd2 = sc.makeRDD(...).partitionBy(2)

Re: Spark and Play

2014-11-11 Thread John Meehan
You can also build a Play 2.2.x + Spark 1.1.0 fat jar with sbt-assembly for, e.g. yarn-client support or using with spark-shell for debugging: play.Project.playScalaSettings libraryDependencies ~= { _ map { case m if m.organization == com.typesafe.play = m.exclude(commons-logging,

Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
The `conf` object will be sent to other nodes via Broadcast. Here is the scaladoc of Broadcast: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.broadcast.Broadcast In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes

Re: Read a HDFS file from Spark source code

2014-11-11 Thread rapelly kartheek
Hi Sean, I was following this link; http://mund-consulting.com/Blog/Posts/file-operations-in-HDFS-using-java.aspx But, I was facing FileSystem ambiguity error. I really don't have any idea as to how to go about doing this. Can you please help me how to start off with this? On Wed, Nov 12, 2014

<    1   2