Re: o.a.s.u.Vector instances for equality

2014-03-03 Thread Shixiong Zhu
Regards, Shixiong Zhu 2014-03-04 4:23 GMT+08:00 Oleksandr Olgashko alexandrolg...@gmail.com: Hello. How should i better check two Vector's for equality? val a = new Vector(Array(1)) val b = new Vector(Array(1)) println(a == b) // false

Re: sequenceFile and groupByKey

2014-03-09 Thread Shixiong Zhu
().take(5) Best Regards, Shixiong Zhu 2014-03-09 13:30 GMT+08:00 Kane kane.ist...@gmail.com: when i try to open sequence file: val t2 = sc.sequenceFile(/user/hdfs/e1Mseq, classOf[String], classOf[String]) t2.groupByKey().take(5) I get: org.apache.spark.SparkException: Job aborted: Task 25.0:0

Re: Joining two HDFS files in in Spark

2014-03-19 Thread Shixiong Zhu
to create a RDD from a collection. Best Regards, Shixiong Zhu 2014-03-19 20:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com: Not sure what you mean by not getting information how to join. If you mean that you can't see the result I believe you need to collect the result of the join

Re: How to index each map operation????

2014-04-02 Thread Shixiong Zhu
solution is using rdd.partitionBy(new HashPartitioner(1)) to make sure there is only one partition. But that's not efficient for big input. Best Regards, Shixiong Zhu 2014-04-02 11:10 GMT+08:00 Thierry Herrmann thierry.herrm...@gmail.com: I'm new to Spark, but isn't this a pure scala question

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread Shixiong Zhu
You can use JavaPairRDD.saveAsHadoopFile/saveAsNewAPIHadoopFile. Best Regards, Shixiong Zhu 2014-06-20 14:22 GMT+08:00 abhiguruvayya sharath.abhis...@gmail.com: Any inputs on this will be helpful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
Best Regards, Shixiong Zhu 2014-08-14 22:11 GMT+08:00 Christopher Nguyen c...@adatao.com: Hi Hoai-Thu, the issue of private default constructor is unlikely the cause here, since Lance was already able to load/deserialize the model object. And on that side topic, I wish all serdes libraries

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think in the following case class Foo { def foo() = Array(1.0) } val t = new Foo val m = t.foo val r1 = sc.parallelize(List(1, 2, 3)) val r2 = r1.map(_ + m(0)) r2.toArray Spark should not serialize t. But looks it will. Best Regards, Shixiong Zhu 2014-08-14 23:22 GMT+08:00 lancezhange

OutOfMemory in cogroup

2014-10-27 Thread Shixiong Zhu
and these values cannot fit into memory. Spilling data to disk helps nothing because cogroup needs to read all values for a key into memory. Any suggestion to solve these OOM cases? Thank you,. Best Regards, Shixiong Zhu

Re: OutOfMemory in cogroup

2014-10-27 Thread Shixiong Zhu
to check if anyone has similar problem and better solution. ​ Best Regards, Shixiong Zhu 2014-10-28 0:13 GMT+08:00 Holden Karau hol...@pigscanfly.ca: On Monday, October 27, 2014, Shixiong Zhu zsxw...@gmail.com wrote: We encountered some special OOM cases of cogroup when the data in one

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-27 Thread Shixiong Zhu
Are you using spark standalone mode? If so, you need to set spark.io.compression.codec for all workers. Best Regards, Shixiong Zhu 2014-10-28 10:37 GMT+08:00 buring qyqb...@gmail.com: Here is error log,I abstract as follows: INFO [binaryTest---main]: before first WARN

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-28 Thread Shixiong Zhu
I mean updating the spark conf not only in the driver, but also in the Spark Workers. Because the driver configurations cannot be read by the Executors, they still use the default spark.io.compression.codec to deserialize the tasks. Best Regards, Shixiong Zhu 2014-10-28 16:39 GMT+08:00 buring

Re: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Shixiong Zhu
Or def getAs[T](i: Int): T Best Regards, Shixiong Zhu 2014-10-29 13:16 GMT+08:00 Zhan Zhang zzh...@hortonworks.com: Can you use row(i).asInstanceOf[] Thanks. Zhan Zhang On Oct 28, 2014, at 5:03 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The Spark SQL Row class has

Re: Task size variation while using Range Vs List

2014-11-05 Thread Shixiong Zhu
is not persisted, Spark needs to load the data again. You can call RDD.cache to persist the RDD in the memory. Best Regards, Shixiong Zhu 2014-11-06 11:35 GMT+08:00 nsareen nsar...@gmail.com: I noticed a behaviour where it was observed that, if i'm using val temp = sc.parallelize ( 1 to 10

Re: Any limitations of spark.shuffle.spill?

2014-11-05 Thread Shixiong Zhu
Two limitations we found here: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-in-quot-cogroup-quot-td17349.html Best Regards, Shixiong Zhu 2014-11-06 2:04 GMT+08:00 Yangcheng Huang yangcheng.hu...@huawei.com: Hi One question about the power of spark.shuffle.spill – (I

Re: How to trace/debug serialization?

2014-11-05 Thread Shixiong Zhu
. Best Regards, Shixiong Zhu 2014-11-06 7:56 GMT+08:00 ankits ankitso...@gmail.com: In my spark job, I have a loop something like this: bla.forEachRdd(rdd = { //init some vars rdd.forEachPartition(partiton = { //init some vars partition.foreach(kv = { ... I am seeing

Re: How to trace/debug serialization?

2014-11-06 Thread Shixiong Zhu
Will this work even with Kryo Serialization ? Now spark.closure.serializer must be org.apache.spark.serializer.JavaSerializer. Therefore the serialization closure functions won’t be involved with Kryo. Kryo is only used to serialize the data. ​ Best Regards, Shixiong Zhu 2014-11-07 12:27 GMT+08

Re: Bug in Accumulators...

2014-11-07 Thread Shixiong Zhu
it? Is there a SparkContext field in the outer class? Best Regards, Shixiong Zhu 2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch: I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works if I run it in local mode! ) If I put the accumulator inside the for loop, everything

Re: sql - group by on UDF not working

2014-11-07 Thread Shixiong Zhu
Now it doesn't support such query. I can easily reproduce it. Created a JIRA here: https://issues.apache.org/jira/browse/SPARK-4296 Best Regards, Shixiong Zhu 2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com: I am trying to group by on a calculated field. Is it supported

Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
Could you provide the code of hbaseQuery? It maybe doesn't support to execute in parallel. Best Regards, Shixiong Zhu 2014-11-12 14:32 GMT+08:00 qiaou qiaou8...@gmail.com: Hi: I got a problem with using the union method of RDD things like this I get a function like def hbaseQuery

Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later). Best Regards, Shixiong Zhu 2014-11-12 15:20 GMT+08:00 qiaou qiaou8...@gmail.com: this work! but can you explain why should use like this? -- qiaou 已使用 Sparrow http://www.sparrowmailapp.com

Re: Lots of small input files

2014-11-23 Thread Shixiong Zhu
to create some big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it may be not efficient because it still creates many tiny tasks. Best Regards, Shixiong Zhu 2014-11-22 17:17 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com: What is your cluster setup? are you running a worker

Re: Negative Accumulators

2014-11-24 Thread Shixiong Zhu
: scala.math.BigInt = 100 ​ Best Regards, Shixiong Zhu 2014-11-25 10:31 GMT+08:00 Peter Thai thai.pe...@gmail.com: Hello! Does anyone know why I may be receiving negative final accumulator values? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu 2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com: I meet the same problem, did you solve it ? -- View this message in context: http://apache-spark-user-list

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Sorry. Should be not greater than 2048. 2047 is the greatest value. Best Regards, Shixiong Zhu 2014-12-01 13:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: 4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu 2014-12

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Created a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-4664 Best Regards, Shixiong Zhu 2014-12-01 13:22 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: Sorry. Should be not greater than 2048. 2047 is the greatest value. Best Regards, Shixiong Zhu 2014-12-01 13:20 GMT+08:00

Re: Setting network variables in spark-shell

2014-12-01 Thread Shixiong Zhu
Don't set `spark.akka.frameSize` to 1. The max value of `spark.akka.frameSize` is 2047. The unit is MB. Best Regards, Shixiong Zhu 2014-12-01 0:51 GMT+08:00 Yanbo yanboha...@gmail.com: Try to use spark-shell --conf spark.akka.frameSize=1 在 2014年12月1日,上午12:25,Brian Dolan buddha_

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
What's the status of this application in the yarn web UI? Best Regards, Shixiong Zhu 2014-12-05 17:22 GMT+08:00 LinQili lin_q...@outlook.com: I tried anather test code: def main(args: Array[String]) { if (args.length != 1) { Util.printLog(ERROR, Args error - arg1: BASE_DIR

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
not send it back to the client. spark-submit will return 1 when Yarn reports the ApplicationMaster failed. ​ Best Regards, Shixiong Zhu 2014-12-06 1:59 GMT+08:00 LinQili lin_q...@outlook.com: You mean the localhost:4040 or the application master web ui? Sent from my iPhone On Dec 5, 2014, at 17:26

Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
, Shixiong Zhu 2014-12-10 20:13 GMT+08:00 Johannes Simon johannes.si...@mail.de: Hi! I have been using spark a lot recently and it's been running really well and fast, but now when I increase the data size, it's starting to run into problems: I have an RDD in the form of (String, Iterable[String

Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
Good catch. `Join` should use `Iterator`, too. I open an JIRA here: https://issues.apache.org/jira/browse/SPARK-4824 Best Regards, Shixiong Zhu 2014-12-10 21:35 GMT+08:00 Johannes Simon johannes.si...@mail.de: Hi! Using an iterator solved the problem! I've been chewing on this for days, so

Re: Serialization issue when using HBase with Spark

2014-12-15 Thread Shixiong Zhu
Just point out a bug in your codes. You should not use `mapPartitions` like that. For details, I recommend Section setup() and cleanup() in Sean Owen's post: http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/ Best Regards, Shixiong Zhu 2014-12-14 16:35 GMT+08

Re: NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread Shixiong Zhu
Could you post the stack trace? Best Regards, Shixiong Zhu 2014-12-16 23:21 GMT+08:00 richiesgr richie...@gmail.com: Hi This time I need expert. On 1.1.1 and only in cluster (standalone or EC2) when I use this code : countersPublishers.foreachRDD(rdd = { rdd.foreachPartition

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Shixiong Zhu
@Rui do you mean the spark-core jar in the maven central repo are incompatible with the same version of the the official pre-built Spark binary? That's really weird. I thought they should have used the same codes. Best Regards, Shixiong Zhu 2014-12-18 17:22 GMT+08:00 Sean Owen so...@cloudera.com

Re: Announcing Spark 1.2!

2014-12-19 Thread Shixiong Zhu
Congrats! A little question about this release: Which commit is this release based on? v1.2.0 and v1.2.0-rc2 are pointed to different commits in https://github.com/apache/spark/releases Best Regards, Shixiong Zhu 2014-12-19 16:52 GMT+08:00 Patrick Wendell pwend...@gmail.com: I'm happy

Re: Dynamic Allocation in Spark 1.2.0

2014-12-27 Thread Shixiong Zhu
I encountered the following issue when enabling dynamicAllocation. You may want to take a look at it. https://issues.apache.org/jira/browse/SPARK-4951 Best Regards, Shixiong Zhu 2014-12-28 2:07 GMT+08:00 Tsuyoshi OZAWA ozawa.tsuyo...@gmail.com: Hi Anders, I faced the same issue as you

Re: recent join/iterator fix

2014-12-29 Thread Shixiong Zhu
The Iterable from cogroup is CompactBuffer, which is already materialized. It's not a lazy Iterable. So now Spark cannot handle skewed data that some key has too many values that cannot be fit into the memory.​

Re: Problem with changing the akka.framesize parameter

2015-02-04 Thread Shixiong Zhu
The unit of spark.akka.frameSize is MB. The max value is 2047. Best Regards, Shixiong Zhu 2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com: I am trying to run a spark application with -Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000 -Dspark.akka.frameSize=1

Re: Problem with changing the akka.framesize parameter

2015-02-04 Thread Shixiong Zhu
Could you clarify why you need a 10G akka frame size? Best Regards, Shixiong Zhu 2015-02-05 9:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: The unit of spark.akka.frameSize is MB. The max value is 2047. Best Regards, Shixiong Zhu 2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com: I

Re: Issue with SparkContext in cluster

2015-01-28 Thread Shixiong Zhu
It's because you committed the job in Windows to a Hadoop cluster running in Linux. Spark has not yet supported it. See https://issues.apache.org/jira/browse/SPARK-1825 Best Regards, Shixiong Zhu 2015-01-28 17:35 GMT+08:00 Marco marco@gmail.com: I've created a spark app, which runs fine

Re: ClassNotFoundException when registering classes with Kryo

2015-02-01 Thread Shixiong Zhu
It's a bug that has been fixed in https://github.com/apache/spark/pull/4258 but not yet been merged. Best Regards, Shixiong Zhu 2015-02-02 10:08 GMT+08:00 Arun Lists lists.a...@gmail.com: Here is the relevant snippet of code in my main program

Re: Joining by values

2015-01-03 Thread Shixiong Zhu
call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards, Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrote https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main

Re: Trying to execute Spark in Yarn

2015-01-08 Thread Shixiong Zhu
`--jars` accepts a comma-separated list of jars. See the usage about `--jars` --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. Best Regards, Shixiong Zhu 2015-01-08 19:23 GMT+08:00 Guillermo Ortiz konstt2...@gmail.com: I'm trying to execute

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Shixiong Zhu
. For me, I will addd -Dhbase.profile=hadoop2 to the build instruction so that the examples project will use a haoop2-compatible hbase. Best Regards, Shixiong Zhu 2015-01-08 0:30 GMT+08:00 Antony Mayi antonym...@yahoo.com.invalid: thanks, I found the issue, I was including /usr/lib/spark/lib

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Shixiong Zhu
cases are the second one, we set spark.scheduler.executorTaskBlacklistTime to 3 to solve such No space left on device errors. So if a task runs unsuccessfully in some executor, it won't be scheduled to the same executor in 30 seconds. Best Regards, Shixiong Zhu 2015-03-16 17:40 GMT+08:00 Jianshi

Re: Support for skewed joins in Spark

2015-03-12 Thread Shixiong Zhu
. Best Regards, Shixiong Zhu 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com: Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
There is no configuration for it now. Best Regards, Shixiong Zhu 2015-03-26 7:13 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: There may be firewall rules limiting the ports between host running spark and the hadoop cluster. In that case, not all ports are allowed. Can it be a range

Re: How to specify the port for AM Actor ...

2015-03-29 Thread Shixiong Zhu
LGTM. Could you open a JIRA and send a PR? Thanks. Best Regards, Shixiong Zhu 2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: I looked @ the 1.3.0 code and figured where this can be added In org.apache.spark.deploy.yarn ApplicationMaster.scala:282 is actorSystem

Re: Actor not found

2015-03-30 Thread Shixiong Zhu
Could you paste the whole stack trace here? Best Regards, Shixiong Zhu 2015-03-31 2:26 GMT+08:00 sparkdi shopaddr1...@dubna.us: I have the same problem, i.e. exception with the same call stack when I start either pyspark or spark-shell. I use spark-1.3.0-bin-hadoop2.4 on ubuntu 14.10. bin

Re: Actor not found

2015-03-31 Thread Shixiong Zhu
Thanks for the log. It's really helpful. I created a JIRA to explain why it will happen: https://issues.apache.org/jira/browse/SPARK-6640 However, will this error always happens in your environment? Best Regards, Shixiong Zhu 2015-03-31 22:36 GMT+08:00 sparkdi shopaddr1...@dubna.us

Re: java.util.NoSuchElementException: key not found:

2015-02-27 Thread Shixiong Zhu
RDD is not thread-safe. You should not use it in multiple threads. Best Regards, Shixiong Zhu 2015-02-27 23:14 GMT+08:00 rok rokros...@gmail.com: I'm seeing this java.util.NoSuchElementException: key not found: exception pop up sometimes when I run operations on an RDD from multiple threads

Re: Not able to update collections

2015-02-24 Thread Shixiong Zhu
Rdd.foreach runs in the executors. You should use `collect` to fetch data to the driver. E.g., myRdd.collect().foreach { node = { mp(node) = 1 } } Best Regards, Shixiong Zhu 2015-02-25 4:00 GMT+08:00 Vijayasarathy Kannan kvi...@vt.edu: Thanks, but it still doesn't seem

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
It's a random port to avoid port conflicts, since multiple AMs can run in the same machine. Why do you need a fixed port? Best Regards, Shixiong Zhu 2015-03-26 6:49 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: Spark 1.3, Hadoop 2.5, Kerbeors When running spark-shell in yarn client mode

Re: Can't get SparkListener to work

2015-04-21 Thread Shixiong Zhu
it from Eclipse on local[*]. On Sun, Apr 19, 2015 at 7:57 PM, Praveen Balaji secondorderpolynom...@gmail.com wrote: Thanks Shixiong. I'll try this. On Sun, Apr 19, 2015, 7:36 PM Shixiong Zhu zsxw...@gmail.com wrote: The problem is the code you use to test: sc.parallelize(List(1, 2, 3

Re: Can't get SparkListener to work

2015-04-19 Thread Shixiong Zhu
The problem is the code you use to test: sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); is like the following example: def foo: Int = Nothing = { throw new SparkException(test) } sc.parallelize(List(1, 2, 3)).map(foo).collect(); So actually the Spark jobs do not

Re: How to install spark in spark on yarn mode

2015-04-30 Thread Shixiong Zhu
://spark.apache.org/docs/latest/running-on-yarn.html Best Regards, Shixiong Zhu 2015-04-30 1:00 GMT-07:00 xiaohe lan zombiexco...@gmail.com: Hi Madhvi, If I only install spark on one node, and use spark-submit to run an application, which are the Worker nodes? Any where are the executors ? Thanks, Xiaohe

Re: Enabling Event Log

2015-04-30 Thread Shixiong Zhu
spark.history.fs.logDirectory is for the history server. For Spark applications, they should use spark.eventLog.dir. Since you commented out spark.eventLog.dir, it will be /tmp/spark-events. And this folder does not exits. Best Regards, Shixiong Zhu 2015-04-29 23:22 GMT-07:00 James King jakwebin

Re: Timeout Error

2015-04-26 Thread Shixiong Zhu
The configuration key should be spark.akka.askTimeout for this timeout. The time unit is seconds. Best Regards, Shixiong(Ryan) Zhu 2015-04-26 15:15 GMT-07:00 Deepak Gopalakrishnan dgk...@gmail.com: Hello, Just to add a bit more context : I have done that in the code, but I cannot see it

Re: history server

2015-05-07 Thread Shixiong Zhu
The history server may need several hours to start if you have a lot of event logs. Is it stuck, or still replaying logs? Best Regards, Shixiong Zhu 2015-05-07 11:03 GMT-07:00 Marcelo Vanzin van...@cloudera.com: Can you get a jstack for the process? Maybe it's stuck somewhere. On Thu, May 7

Re: history server

2015-05-07 Thread Shixiong Zhu
SPARK-5522 is really cool. Didn't notice it. Best Regards, Shixiong Zhu 2015-05-07 11:36 GMT-07:00 Marcelo Vanzin van...@cloudera.com: That shouldn't be true in 1.3 (see SPARK-5522). On Thu, May 7, 2015 at 11:33 AM, Shixiong Zhu zsxw...@gmail.com wrote: The history server may need several

Re:

2015-05-06 Thread Shixiong Zhu
You are using Scala 2.11 with 2.10 libraries. You can change org.apache.spark % spark-streaming_2.10 % 1.3.1 to org.apache.spark %% spark-streaming % 1.3.1 And sbt will use the corresponding libraries according to your Scala version. Best Regards, Shixiong Zhu 2015-05-06 16:21 GMT-07:00

Re: Problem with current spark

2015-05-15 Thread Shixiong Zhu
Could your provide the full driver log? Looks like a bug. Thank you! Best Regards, Shixiong Zhu 2015-05-13 14:02 GMT-07:00 Giovanni Paolo Gibilisco gibb...@gmail.com: Hi, I'm trying to run an application that uses a Hive context to perform some queries over JSON files. The code

Re: Actor not found

2015-04-17 Thread Shixiong Zhu
I just checked the codes about creating OutputCommitCoordinator. Could you reproduce this issue? If so, could you provide details about how to reproduce it? Best Regards, Shixiong(Ryan) Zhu 2015-04-16 13:27 GMT+08:00 Canoe canoe...@gmail.com: 13119 Exception in thread main

Re: spark streaming printing no output

2015-04-14 Thread Shixiong Zhu
Could you see something like this in the console? --- Time: 142905487 ms --- Best Regards, Shixiong(Ryan) Zhu 2015-04-15 2:11 GMT+08:00 Shushant Arora shushantaror...@gmail.com: Hi I am running a spark

Re: spark streaming printing no output

2015-04-15 Thread Shixiong Zhu
: 142905487 ms strings gets printed on console. No output is getting printed. And timeinterval between two strings of form ( time:ms)is very less than Streaming Duration set in program. On Wed, Apr 15, 2015 at 5:11 AM, Shixiong Zhu zsxw...@gmail.com wrote: Could you see something like

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Shixiong Zhu
Cleaner java.lang.NoClassDefFoundError: 0 at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149) Best Regards, Shixiong Zhu 2015-06-03 0:08 GMT+08:00 Ryan Williams ryan.blake.willi...@gmail.com: I think

Re: learning rpc about spark core source code

2015-06-10 Thread Shixiong Zhu
the communication between driver and executors? Because this is an ongoing work, there is no blog now. But you can find more details in this umbrella JIRA: https://issues.apache.org/jira/browse/SPARK-5293 Best Regards, Shixiong Zhu 2015-06-10 20:33 GMT+08:00 huangzheng 1106944...@qq.com: Hi all

Re: StreamingListener, anyone?

2015-06-04 Thread Shixiong Zhu
You should not call `jssc.stop(true);` in a StreamingListener. It will cause a dead-lock: `jssc.stop` won't return until `listenerBus` exits. But since `jssc.stop` blocks `StreamingListener`, `listenerBus` cannot exit. Best Regards, Shixiong Zhu 2015-06-04 0:39 GMT+08:00 dgoldenberg dgoldenberg

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Shixiong Zhu
How about other jobs? Is it an executor log, or a driver log? Could you post other logs near this error, please? Thank you. Best Regards, Shixiong Zhu 2015-06-02 17:11 GMT+08:00 Anders Arpteg arp...@spotify.com: Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that worked

Re: Application jar file not found exception when submitting application

2015-07-06 Thread Shixiong Zhu
Before running your script, could you confirm that /data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar exists? You might forget to build this jar. Best Regards, Shixiong Zhu 2015-07-06 18:14 GMT+08:00 bit1...@163.com bit1...@163.com: Hi, I have following

Re: How to shut down spark web UI?

2015-07-06 Thread Shixiong Zhu
You can set spark.ui.enabled to false to disable the Web UI. Best Regards, Shixiong Zhu 2015-07-06 17:05 GMT+08:00 luohui20...@sina.com: Hello there, I heard that there is some way to shutdown Spark WEB UI, is there a configuration to support this? Thank you

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-21 Thread Shixiong Zhu
`ssc.stop` as a the shutdown hook. But stopGracefully should be false. Best Regards, Shixiong Zhu 2015-05-20 21:59 GMT-07:00 Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com: Thanks Tathagata for making this change.. Dibyendu On Thu, May 21, 2015 at 8:24 AM, Tathagata Das t

Re: spark streaming map use external variable occur a problem

2015-08-14 Thread Shixiong Zhu
file. Could you convert your data to String using map and use saveAsTextFile or other save methods? Best Regards, Shixiong Zhu 2015-08-14 11:02 GMT+08:00 kale 805654...@qq.com: - To unsubscribe, e-mail: user-unsubscr

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Shixiong Zhu
Oh, I see. That's the total time of executing a query in Spark. Then the difference is reasonable, considering Spark has much more work to do, e.g., launching tasks in executors. Best Regards, Shixiong Zhu 2015-07-26 16:16 GMT+08:00 Louis Hust louis.h...@gmail.com: Look at the given url

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Shixiong Zhu
Could you clarify how you measure the Spark time cost? Is it the total time of running the query? If so, it's possible because the overhead of Spark dominates for small queries. Best Regards, Shixiong Zhu 2015-07-26 15:56 GMT+08:00 Jerrick Hoang jerrickho...@gmail.com: how big is the dataset

Re: Anybody hit this issue in spark shell?

2015-11-10 Thread Shixiong Zhu
to find similar issues in the PR build. Best Regards, Shixiong Zhu 2015-11-09 18:47 GMT-08:00 Ted Yu <yuzhih...@gmail.com>: > Created https://github.com/apache/spark/pull/9585 > > Cheers > > On Mon, Nov 9, 2015 at 6:39 PM, Josh Rosen <joshro...@databricks.com> > wrote

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Shixiong Zhu
In addition, if you have more than two text files, you can just put them into a Seq and use "reduce(_ ++ _)". Best Regards, Shixiong Zhu 2015-11-11 10:21 GMT-08:00 Jakob Odersky <joder...@gmail.com>: > Hey Jeff, > Do you mean reading from multiple text files? In that c

Re: Memory are not used according to setting

2015-11-04 Thread Shixiong Zhu
You should use `SparkConf.set` rather than `SparkConf.setExecutorEnv`. For driver configurations, you need to set them before starting your application. You can use the `--conf` argument before running `spark-submit`. Best Regards, Shixiong Zhu 2015-11-04 15:55 GMT-08:00 William Li

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Shixiong Zhu
"trackStateByKey" is about to be added in 1.6 to resolve the performance issue of "updateStateByKey". You can take a look at https://issues.apache.org/jira/browse/SPARK-2629 and https://github.com/apache/spark/pull/9256

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-15 Thread Shixiong Zhu
Thanks for reporting it Terry. I submitted a PR to fix it: https://github.com/apache/spark/pull/9132 Best Regards, Shixiong Zhu 2015-10-15 2:39 GMT+08:00 Reynold Xin <r...@databricks.com>: > +dev list > > On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo <hujie.ea...@gmail.co

Re: spark-shell :javap fails with complaint about JAVA_HOME, but it is set correctly

2015-10-15 Thread Shixiong Zhu
Scala 2.10 REPL javap doesn't support Java7 or Java8. It was fixed in Scala 2.11. See https://issues.scala-lang.org/browse/SI-4936 Best Regards, Shixiong Zhu 2015-10-15 4:19 GMT+08:00 Robert Dodier <robert.dod...@gmail.com>: > Hi, > > I am working with Spark 1.5.1 (o

Re: What is the abstraction for a Worker process in Spark code

2015-10-12 Thread Shixiong Zhu
Which mode are you using? For standalone, it's org.apache.spark.deploy.worker.Worker. For Yarn and Mesos, Spark just submits its request to them and they will schedule processes for Spark. Best Regards, Shixiong Zhu 2015-10-12 20:12 GMT+08:00 Muhammad Haseeb Javed <11besemja...@seecs.edu

Re: Spark UI consuming lots of memory

2015-10-12 Thread Shixiong Zhu
In addition, you cannot turn off JobListener and SQLListener now... Best Regards, Shixiong Zhu 2015-10-13 11:59 GMT+08:00 Shixiong Zhu <zsxw...@gmail.com>: > Is your query very complicated? Could you provide the output of `explain` > your query that consumes an excessive amou

Re: Spark UI consuming lots of memory

2015-10-12 Thread Shixiong Zhu
Could you show how did you set the configurations? You need to set these configurations before creating SparkContext and SQLContext. Moreover, the history sever doesn't support SQL UI. So "spark.eventLog.enabled=true" doesn't work now. Best Regards, Shixiong Zhu 2015-10-13 2:01

Re: Unexplained sleep time

2015-10-12 Thread Shixiong Zhu
You don't need to care about this sleep. It runs in a separate thread and usually won't affect the performance of your application. Best Regards, Shixiong Zhu 2015-10-09 6:03 GMT+08:00 yael aharon <yael.aharo...@gmail.com>: > Hello, > I am working on improving the performance

Re: Spark UI consuming lots of memory

2015-10-12 Thread Shixiong Zhu
Is your query very complicated? Could you provide the output of `explain` your query that consumes an excessive amount of memory? If this is a small query, there may be a bug that leaks memory in SQLListener. Best Regards, Shixiong Zhu 2015-10-13 11:44 GMT+08:00 Nicholas Pritchard

Re: Creating Custom Receiver for Spark Streaming

2015-10-12 Thread Shixiong Zhu
Each ReceiverInputDStream will create one Receiver. If you only use one ReceiverInputDStream, there will be only one Receiver in the cluster. But if you create multiple ReceiverInputDStreams, there will be multiple Receivers. Best Regards, Shixiong Zhu 2015-10-12 23:47 GMT+08:00 Something

Re: Data skipped while writing Spark Streaming output to HDFS

2015-10-12 Thread Shixiong Zhu
Could you print the content of RDD to check if there are multiple values for a key in a batch? Best Regards, Shixiong Zhu 2015-10-12 18:25 GMT+08:00 Sathiskumar <sathish.palaniap...@gmail.com>: > I'm running a Spark Streaming application for every 10 seconds, its job is > to > co

Re: (de)serialize DStream

2015-07-08 Thread Shixiong Zhu
DStream must be Serializable, it's metadata checkpointing. But you can use KryoSerializer for data checkpointing. The data checkpointing uses RDD.checkpoint which can be set by spark.serializer. Best Regards, Shixiong Zhu 2015-07-08 3:43 GMT+08:00 Chen Song chen.song...@gmail.com: In Spark

Re: Some BlockManager Doubts

2015-07-09 Thread Shixiong Zhu
MemoryStore.ensureFreeSpace for details. Best Regards, Shixiong Zhu 2015-07-09 19:17 GMT+08:00 Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com: Hi , Just would like to clarify few doubts I have how BlockManager behaves . This is mostly in regards to Spark Streaming Context . There are two

Re: change default storage level

2015-07-09 Thread Shixiong Zhu
r1 = context.wholeTextFiles(...) val r2 = r1.flatMap(s - ...) r2.persist(StorageLevel.MEMORY) val r3 = r2.filter(...)... r3.saveAsTextFile(...) val r4 = r2.map(...)... r4.saveAsTextFile(...) See http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence Best Regards, Shixiong Zhu

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
Hao, I can reproduce it using the master branch. I'm curious why you cannot reproduce it. Did you check if the input HadoopRDD did have two partitions? My test code is val df = sqlContext.read.json(examples/src/main/resources/people.json) df.show() Best Regards, Shixiong Zhu 2015-08-25 13:01

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
/org/apache/spark/sql/execution/SparkPlan.scala#L185 Best Regards, Shixiong Zhu 2015-08-25 8:11 GMT+08:00 Jeff Zhang zjf...@gmail.com: Hi Cheng, I know that sqlContext.read will trigger one spark job to infer the schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Shixiong Zhu
That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this case. Best Regards, Shixiong Zhu 2015-08-25 14:01 GMT+08:00 Cheng, Hao hao.ch...@intel.com: O, Sorry, I miss reading your reply! I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 jobs

Re: SparkContext initialization error- java.io.IOException: No space left on device

2015-09-06 Thread Shixiong Zhu
The folder is in "/tmp" by default. Could you use "df -h" to check the free space of /tmp? Best Regards, Shixiong Zhu 2015-09-05 9:50 GMT+08:00 shenyan zhen <shenya...@gmail.com>: > Has anyone seen this error? Not sure which dir the program was trying to > write

Re: ClassCastException in driver program

2015-09-06 Thread Shixiong Zhu
(i) i1.readObject() Could you provide the "explain" output? It would be helpful to find the circular references. Best Regards, Shixiong Zhu 2015-09-05 0:26 GMT+08:00 Jeff Jones <jjo...@adaptivebiotech.com>: > We are using Scala 2.11 for a driver program that is running

Re: Spark Streaming Log4j Inside Eclipse

2015-09-29 Thread Shixiong Zhu
I mean JavaSparkContext.setLogLevel. You can use it like this: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2)); jssc.sparkContext().setLogLevel(...); Best Regards, Shixiong Zhu 2015-09-29 22:07 GMT+08:00 Ashish Soni <asoni.le...@gmail.com>: > I

Re: Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Shixiong Zhu
You can use JavaSparkContext.setLogLevel to set the log level in your codes. Best Regards, Shixiong Zhu 2015-09-28 22:55 GMT+08:00 Ashish Soni <asoni.le...@gmail.com>: > I am not running it using spark submit , i am running locally inside > Eclipse IDE , how i set this usi

Re: What is the best way to submit multiple tasks?

2015-10-01 Thread Shixiong Zhu
Right, you can use SparkContext and SQLContext in multiple threads. They are thread safe. Best Regards, Shixiong Zhu 2015-10-01 4:57 GMT+08:00 <saif.a.ell...@wellsfargo.com>: > Hi all, > > I have a process where I do some calculations on each one of the columns > of a datafram

Re: Spark Streaming Standalone 1.5 - Stage cancelled because SparkContext was shut down

2015-10-01 Thread Shixiong Zhu
Do you have the log? Looks like some exceptions in your codes make SparkContext stopped. Best Regards, Shixiong Zhu 2015-09-30 17:30 GMT+08:00 tranan <tra...@gmail.com>: > Hello All, > > I have several Spark Streaming applications running on Standalone mode in > Spark 1.5.

Re: Worker node timeout exception

2015-10-01 Thread Shixiong Zhu
Do you have the log file? It may be because of wrong settings. Best Regards, Shixiong Zhu 2015-10-01 7:32 GMT+08:00 markluk <m...@juicero.com>: > I setup a new Spark cluster. My worker node is dying with the following > exception. > > Caused by: java.util.concurrent.Timeout

  1   2   >