Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later). Best Regards, Shixiong Zhu 2014-11-12 15:20 GMT+08:00 qiaou qiaou8...@gmail.com: this work! but can you explain why should use like this? -- qiaou 已使用 Sparrow http://www.sparrowmailapp.com

Re: Bug in Accumulators...

2014-11-07 Thread Shixiong Zhu
it? Is there a SparkContext field in the outer class? Best Regards, Shixiong Zhu 2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch: I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works if I run it in local mode! ) If I put the accumulator inside the for loop, everything

Re: sql - group by on UDF not working

2014-11-07 Thread Shixiong Zhu
Now it doesn't support such query. I can easily reproduce it. Created a JIRA here: https://issues.apache.org/jira/browse/SPARK-4296 Best Regards, Shixiong Zhu 2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com: I am trying to group by on a calculated field. Is it supported

Re: How to trace/debug serialization?

2014-11-06 Thread Shixiong Zhu
Will this work even with Kryo Serialization ? Now spark.closure.serializer must be org.apache.spark.serializer.JavaSerializer. Therefore the serialization closure functions won’t be involved with Kryo. Kryo is only used to serialize the data. ​ Best Regards, Shixiong Zhu 2014-11-07 12:27 GMT+08

Re: Task size variation while using Range Vs List

2014-11-05 Thread Shixiong Zhu
is not persisted, Spark needs to load the data again. You can call RDD.cache to persist the RDD in the memory. Best Regards, Shixiong Zhu 2014-11-06 11:35 GMT+08:00 nsareen nsar...@gmail.com: I noticed a behaviour where it was observed that, if i'm using val temp = sc.parallelize ( 1 to 10

Re: Any limitations of spark.shuffle.spill?

2014-11-05 Thread Shixiong Zhu
Two limitations we found here: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-in-quot-cogroup-quot-td17349.html Best Regards, Shixiong Zhu 2014-11-06 2:04 GMT+08:00 Yangcheng Huang yangcheng.hu...@huawei.com: Hi One question about the power of spark.shuffle.spill – (I

Re: How to trace/debug serialization?

2014-11-05 Thread Shixiong Zhu
. Best Regards, Shixiong Zhu 2014-11-06 7:56 GMT+08:00 ankits ankitso...@gmail.com: In my spark job, I have a loop something like this: bla.forEachRdd(rdd = { //init some vars rdd.forEachPartition(partiton = { //init some vars partition.foreach(kv = { ... I am seeing

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-28 Thread Shixiong Zhu
I mean updating the spark conf not only in the driver, but also in the Spark Workers. Because the driver configurations cannot be read by the Executors, they still use the default spark.io.compression.codec to deserialize the tasks. Best Regards, Shixiong Zhu 2014-10-28 16:39 GMT+08:00 buring

Re: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Shixiong Zhu
Or def getAs[T](i: Int): T Best Regards, Shixiong Zhu 2014-10-29 13:16 GMT+08:00 Zhan Zhang zzh...@hortonworks.com: Can you use row(i).asInstanceOf[] Thanks. Zhan Zhang On Oct 28, 2014, at 5:03 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The Spark SQL Row class has

OutOfMemory in cogroup

2014-10-27 Thread Shixiong Zhu
and these values cannot fit into memory. Spilling data to disk helps nothing because cogroup needs to read all values for a key into memory. Any suggestion to solve these OOM cases? Thank you,. Best Regards, Shixiong Zhu

Re: OutOfMemory in cogroup

2014-10-27 Thread Shixiong Zhu
to check if anyone has similar problem and better solution. ​ Best Regards, Shixiong Zhu 2014-10-28 0:13 GMT+08:00 Holden Karau hol...@pigscanfly.ca: On Monday, October 27, 2014, Shixiong Zhu zsxw...@gmail.com wrote: We encountered some special OOM cases of cogroup when the data in one

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-27 Thread Shixiong Zhu
Are you using spark standalone mode? If so, you need to set spark.io.compression.codec for all workers. Best Regards, Shixiong Zhu 2014-10-28 10:37 GMT+08:00 buring qyqb...@gmail.com: Here is error log,I abstract as follows: INFO [binaryTest---main]: before first WARN

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
Best Regards, Shixiong Zhu 2014-08-14 22:11 GMT+08:00 Christopher Nguyen c...@adatao.com: Hi Hoai-Thu, the issue of private default constructor is unlikely the cause here, since Lance was already able to load/deserialize the model object. And on that side topic, I wish all serdes libraries

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think in the following case class Foo { def foo() = Array(1.0) } val t = new Foo val m = t.foo val r1 = sc.parallelize(List(1, 2, 3)) val r2 = r1.map(_ + m(0)) r2.toArray Spark should not serialize t. But looks it will. Best Regards, Shixiong Zhu 2014-08-14 23:22 GMT+08:00 lancezhange

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread Shixiong Zhu
You can use JavaPairRDD.saveAsHadoopFile/saveAsNewAPIHadoopFile. Best Regards, Shixiong Zhu 2014-06-20 14:22 GMT+08:00 abhiguruvayya sharath.abhis...@gmail.com: Any inputs on this will be helpful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How

Re: How to index each map operation????

2014-04-02 Thread Shixiong Zhu
solution is using rdd.partitionBy(new HashPartitioner(1)) to make sure there is only one partition. But that's not efficient for big input. Best Regards, Shixiong Zhu 2014-04-02 11:10 GMT+08:00 Thierry Herrmann thierry.herrm...@gmail.com: I'm new to Spark, but isn't this a pure scala question

Re: Joining two HDFS files in in Spark

2014-03-19 Thread Shixiong Zhu
to create a RDD from a collection. Best Regards, Shixiong Zhu 2014-03-19 20:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com: Not sure what you mean by not getting information how to join. If you mean that you can't see the result I believe you need to collect the result of the join

Re: sequenceFile and groupByKey

2014-03-09 Thread Shixiong Zhu
().take(5) Best Regards, Shixiong Zhu 2014-03-09 13:30 GMT+08:00 Kane kane.ist...@gmail.com: when i try to open sequence file: val t2 = sc.sequenceFile(/user/hdfs/e1Mseq, classOf[String], classOf[String]) t2.groupByKey().take(5) I get: org.apache.spark.SparkException: Job aborted: Task 25.0:0

Re: o.a.s.u.Vector instances for equality

2014-03-03 Thread Shixiong Zhu
Regards, Shixiong Zhu 2014-03-04 4:23 GMT+08:00 Oleksandr Olgashko alexandrolg...@gmail.com: Hello. How should i better check two Vector's for equality? val a = new Vector(Array(1)) val b = new Vector(Array(1)) println(a == b) // false

<    1   2