com.google.protobuf out of memory

2014-05-25 Thread Zuhair Khayyat
Dear all, I am getting a OutOfMemoryError in class ByteString.java from package com.google.protobuf when processing very large data using spark 0.9. Does increasing spark.shuffle.memoryFraction helps or I should add more memory to my workers? Below the error I get during execution. 14/05/25

Re: Using Spark to analyze complex JSON

2014-05-25 Thread Michael Armbrust
On Sat, May 24, 2014 at 11:47 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Is the in-memory columnar store planned as part of SparkSQL ? This has already been ported from Shark, and is used when you run cacheTable. Also will both HiveQL SQLParser be kept updated? Yeah, we need to

RE: GraphX partition problem

2014-05-25 Thread Zhicharevich, Alex
Thanks Ankurhttp://www.ankurdave.com/, Built it from git and it works great. I have another issue now. I am trying to process a huge graph with about 20 billion edges with GraphX. I only load the file, compute connected components and persist it right back to disk. When working with subgraphs

PySpark Mesos random crashes

2014-05-25 Thread Perttu Ranta-aho
Hi, We have a small Mesos (0.18.1) cluster with 4 nodes. Upgraded to Spark 1.0.0-rc9, to overcome some PySpark bugs. But now we are experiencing random crashes with almost every job. Local jobs run fine, but same code with same data set in Mesos cluster leads to errors like: 14/05/22 15:03:34

Re: PySpark Mesos random crashes

2014-05-25 Thread Mark Hamstra
The end of your example is the same as SPARK-1749. When a Mesos job causes an exception to be thrown in the DAGScheduler, that causes the DAGScheduler to need to shutdown the system. As part of that shutdown procedure, the DAGScheduler tries to kill any running jobs; but Mesos doesn't support

Re: GraphX partition problem

2014-05-25 Thread Ankur Dave
Once the graph is built, edges are stored in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). Unfortunately, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-25 Thread Andrew Ash
Hi Martin, Tim suggested that you pastebin the mesos logs -- can you share those for the list? Cheers, Andrew On Thu, May 15, 2014 at 5:02 PM, Martin Weindel martin.wein...@gmail.comwrote: Andrew, thanks for your response. When using the coarse mode, the jobs run fine. My problem is the

Re: problem about broadcast variable in iteration

2014-05-25 Thread Andrew Ash
Hi Randy, In Spark 1.0 there was a lot of work done to allow unpersisting data that's no longer needed. See the below pull request. Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the next variable to see if you can cut the dependency there.

Re: KryoSerializer Exception

2014-05-25 Thread Andrew Ash
Hi Andrea, What version of Spark are you using? There were some improvements in how Spark uses Kryo in 0.9.1 and to-be 1.0 that I would expect to improve this. Also, can you share your registrator's code? Another possibility is that Kryo can have some difficulty serializing very large objects.

Re: Comprehensive Port Configuration reference?

2014-05-25 Thread Andrew Ash
Hi Jacob, The config option spark.history.ui.port is new for 1.0 The problem that History server solves is that in non-Standalone cluster deployment modes (Mesos and YARN) there is no long-lived Spark Master that can store logs and statistics about an application after it finishes. History

Re: can communication and computation be overlapped in spark?

2014-05-25 Thread wxhsdp
anyone see my thread? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348p6368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set task number?

2014-05-25 Thread qingyang li
hi, Mayur, thanks for replying. I know spark application should take all cores by default. My question is how to set task number on each core ? If one silce, one task, how can i set silce file size ? 2014-05-23 16:37 GMT+08:00 Mayur Rustagi mayur.rust...@gmail.com: How many cores do you see

counting degrees graphx

2014-05-25 Thread dizzy5112
Hi, looking for a little help on counting the degrees in a graph. Currently my graph consists of 2 subgraphs the and it looks like this: val vertexArray = Array( (1L,(101,x)), (2L,(102,y)), (3L,(103,y)), (4L,(104,y)), (5L,(105,y)), (6L,(106,x)), (7L,(107,x)), (8L,(108,y)) ) val edgeArray =

Re: how to set task number?

2014-05-25 Thread Aaron Davidson
How many partitions are in your input data set? A possibility is that your input data has 10 unsplittable files, so you end up with 10 partitions. You could improve this by using RDD#repartition(). Note that mapPartitionsWithIndex is sort of the main processing loop for many Spark functions. It

Re: how to set task number?

2014-05-25 Thread qingyang li
Hi, Aaron, thanks for sharing. I am using shark to execute query , and table is created on tachyon. I think i can not using RDD#repartition() in shark CLI; if shark support SET mapred.max.split.size to control file size ? if yes, after i create table, i can control file num, then I can

Re: com.google.protobuf out of memory

2014-05-25 Thread Hao Wang
Hi, Zuhair According to my experience, you could try following steps to avoid Spark OOM: 1. Increase JVM memory by adding export SPARK_JAVA_OPTS=-Xmx2g 2. Use .persist(storage.StorageLevel.MEMORY_AND_DISK) instead of .cache() 3. Have you set spark.executor.memory value? It's 512m by

Re: how to set task number?

2014-05-25 Thread Aaron Davidson
You can try setting mapred.map.tasks to get Hive to do the right thing. On Sun, May 25, 2014 at 7:27 PM, qingyang li liqingyang1...@gmail.comwrote: Hi, Aaron, thanks for sharing. I am using shark to execute query , and table is created on tachyon. I think i can not using RDD#repartition()

Fails: Spark sbt/sbt publish local

2014-05-25 Thread ABHISHEK
Hi, I'm trying to install Spark along with Shark. Here's configuration details: Spark 0.9.1 Shark 0.9.1 Scala 2.10.3 Spark assembly was successful but running sbt/sbt publish-local failed. Please refer attached log for more details and advise. Thanks, Abhishek

Re: counting degrees graphx

2014-05-25 Thread Ankur Dave
I'm not sure I understand what you're looking for. Could you provide some more examples to clarify? One interpretation is that you want to tag the source vertices in a graph (those with zero indegree) and find for each vertex the set of sources that lead to that vertex. For vertices 1-8 in the

Re: how to set task number?

2014-05-25 Thread qingyang li
i tried set mapred.map.tasks=30 , it does not work, it seems shark does not support this setting. i also tried SET mapred.max.split.size=6400, it does not work,too. is there other way to control task number in shark CLI ? 2014-05-26 10:38 GMT+08:00 Aaron Davidson ilike...@gmail.com: You

Re: counting degrees graphx

2014-05-25 Thread ankurdave
Sorry, I missed vertex 6 in that example. It should be [{1}, {1}, {1}, {1}, {1, 6}, {6}, {7}, {7}]. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/counting-degrees-graphx-tp6370p6378.html Sent from the Apache Spark User List mailing list archive at

Re: how to set task number?

2014-05-25 Thread Aaron Davidson
What is the format of your input data, prior to insertion into Tachyon? On Sun, May 25, 2014 at 7:52 PM, qingyang li liqingyang1...@gmail.comwrote: i tried set mapred.map.tasks=30 , it does not work, it seems shark does not support this setting. i also tried SET

Re: Fails: Spark sbt/sbt publish local

2014-05-25 Thread Aaron Davidson
I suppose you actually ran publish-local and not publish local like your example showed. That being the case, could you show the compile error that occurs? It could be related to the hadoop version. On Sun, May 25, 2014 at 7:51 PM, ABHISHEK abhi...@gmail.com wrote: Hi, I'm trying to install

Re: Fails: Spark sbt/sbt publish local

2014-05-25 Thread ABHISHEK
Thanks for reply Aaron. I tried with sbt/sbt publish local but got below error. [error] /home/cloudera/at_Installation/spark-0.9.1-bin-cdh4/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669: type mismatch; [error] found :

Re: Fails: Spark sbt/sbt publish local

2014-05-25 Thread Aaron Davidson
Googling that error, I came across something that appears relevant: https://groups.google.com/forum/#!msg/spark-users/T1soH67C5M4/vihzNt92anYJ I'd try just doing sbt/sbt clean first, and if that fails, digging deeper into that thread. (By the way, sbt/sbt publish-local IS what you want,

Re: counting degrees graphx

2014-05-25 Thread dizzy5112
yes thats correct I want the vertex set for each source vertice in the graph. Which of course leads me on to my next question is to add a level to each of these. http://apache-spark-user-list.1001560.n3.nabble.com/file/n6383/image1.jpg For example the image shows the in and out links of the

Re: how to set task number?

2014-05-25 Thread qingyang li
I using create table bigtable002 tblproperties('shark.cache'='tachyon') as select * from bigtable001 to create table bigtable002; while bigtable001 is load from hdfs, it's format is text file , so i think bigtable002's is text. 2014-05-26 11:14 GMT+08:00 Aaron Davidson ilike...@gmail.com: