Dear all,
I am getting a OutOfMemoryError in class ByteString.java from package
com.google.protobuf when processing very large data using spark 0.9. Does
increasing spark.shuffle.memoryFraction helps or I should add more memory
to my workers? Below the error I get during execution.
14/05/25
On Sat, May 24, 2014 at 11:47 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Is the in-memory columnar store planned as part of SparkSQL ?
This has already been ported from Shark, and is used when you run
cacheTable.
Also will both HiveQL SQLParser be kept updated?
Yeah, we need to
Thanks Ankurhttp://www.ankurdave.com/,
Built it from git and it works great.
I have another issue now. I am trying to process a huge graph with about 20
billion edges with GraphX. I only load the file, compute connected components
and persist it right back to disk. When working with subgraphs
Hi,
We have a small Mesos (0.18.1) cluster with 4 nodes. Upgraded to Spark
1.0.0-rc9, to overcome some PySpark bugs. But now we are experiencing
random crashes with almost every job. Local jobs run fine, but same code
with same data set in Mesos cluster leads to errors like:
14/05/22 15:03:34
The end of your example is the same as SPARK-1749. When a Mesos job causes
an exception to be thrown in the DAGScheduler, that causes the DAGScheduler
to need to shutdown the system. As part of that shutdown procedure, the
DAGScheduler tries to kill any running jobs; but Mesos doesn't support
Once the graph is built, edges are stored in parallel primitive arrays, so
each edge should only take 20 bytes to store (srcId: Long, dstId: Long,
attr: Int). Unfortunately, the current implementation in
EdgePartitionBuilder uses an array of Edge objects as an intermediate
representation for
Hi Martin,
Tim suggested that you pastebin the mesos logs -- can you share those for
the list?
Cheers,
Andrew
On Thu, May 15, 2014 at 5:02 PM, Martin Weindel martin.wein...@gmail.comwrote:
Andrew,
thanks for your response. When using the coarse mode, the jobs run fine.
My problem is the
Hi Randy,
In Spark 1.0 there was a lot of work done to allow unpersisting data that's
no longer needed. See the below pull request.
Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the
next variable to see if you can cut the dependency there.
Hi Andrea,
What version of Spark are you using? There were some improvements in how
Spark uses Kryo in 0.9.1 and to-be 1.0 that I would expect to improve this.
Also, can you share your registrator's code?
Another possibility is that Kryo can have some difficulty serializing very
large objects.
Hi Jacob,
The config option spark.history.ui.port is new for 1.0 The problem that
History server solves is that in non-Standalone cluster deployment modes
(Mesos and YARN) there is no long-lived Spark Master that can store logs
and statistics about an application after it finishes. History
anyone see my thread?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348p6368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
hi, Mayur, thanks for replying.
I know spark application should take all cores by default. My question is
how to set task number on each core ?
If one silce, one task, how can i set silce file size ?
2014-05-23 16:37 GMT+08:00 Mayur Rustagi mayur.rust...@gmail.com:
How many cores do you see
Hi, looking for a little help on counting the degrees in a graph. Currently
my graph consists of 2 subgraphs the and it looks like this:
val vertexArray = Array(
(1L,(101,x)),
(2L,(102,y)),
(3L,(103,y)),
(4L,(104,y)),
(5L,(105,y)),
(6L,(106,x)),
(7L,(107,x)),
(8L,(108,y))
)
val edgeArray =
How many partitions are in your input data set? A possibility is that your
input data has 10 unsplittable files, so you end up with 10 partitions. You
could improve this by using RDD#repartition().
Note that mapPartitionsWithIndex is sort of the main processing loop for
many Spark functions. It
Hi, Aaron, thanks for sharing.
I am using shark to execute query , and table is created on tachyon. I
think i can not using RDD#repartition() in shark CLI;
if shark support SET mapred.max.split.size to control file size ?
if yes, after i create table, i can control file num, then I can
Hi, Zuhair
According to my experience, you could try following steps to avoid Spark
OOM:
1. Increase JVM memory by adding export SPARK_JAVA_OPTS=-Xmx2g
2. Use .persist(storage.StorageLevel.MEMORY_AND_DISK) instead of .cache()
3. Have you set spark.executor.memory value? It's 512m by
You can try setting mapred.map.tasks to get Hive to do the right thing.
On Sun, May 25, 2014 at 7:27 PM, qingyang li liqingyang1...@gmail.comwrote:
Hi, Aaron, thanks for sharing.
I am using shark to execute query , and table is created on tachyon. I
think i can not using RDD#repartition()
Hi,
I'm trying to install Spark along with Shark.
Here's configuration details:
Spark 0.9.1
Shark 0.9.1
Scala 2.10.3
Spark assembly was successful but running sbt/sbt publish-local failed.
Please refer attached log for more details and advise.
Thanks,
Abhishek
I'm not sure I understand what you're looking for. Could you provide some
more examples to clarify?
One interpretation is that you want to tag the source vertices in a graph
(those with zero indegree) and find for each vertex the set of sources that
lead to that vertex. For vertices 1-8 in the
i tried set mapred.map.tasks=30 , it does not work, it seems shark does
not support this setting.
i also tried SET mapred.max.split.size=6400, it does not work,too.
is there other way to control task number in shark CLI ?
2014-05-26 10:38 GMT+08:00 Aaron Davidson ilike...@gmail.com:
You
Sorry, I missed vertex 6 in that example. It should be [{1}, {1}, {1}, {1},
{1, 6}, {6}, {7}, {7}].
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/counting-degrees-graphx-tp6370p6378.html
Sent from the Apache Spark User List mailing list archive at
What is the format of your input data, prior to insertion into Tachyon?
On Sun, May 25, 2014 at 7:52 PM, qingyang li liqingyang1...@gmail.comwrote:
i tried set mapred.map.tasks=30 , it does not work, it seems shark does
not support this setting.
i also tried SET
I suppose you actually ran publish-local and not publish local like
your example showed. That being the case, could you show the compile error
that occurs? It could be related to the hadoop version.
On Sun, May 25, 2014 at 7:51 PM, ABHISHEK abhi...@gmail.com wrote:
Hi,
I'm trying to install
Thanks for reply Aaron.
I tried with sbt/sbt publish local but got below error.
[error]
/home/cloudera/at_Installation/spark-0.9.1-bin-cdh4/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669:
type mismatch;
[error] found :
Googling that error, I came across something that appears relevant:
https://groups.google.com/forum/#!msg/spark-users/T1soH67C5M4/vihzNt92anYJ
I'd try just doing sbt/sbt clean first, and if that fails, digging deeper
into that thread.
(By the way, sbt/sbt publish-local IS what you want,
yes thats correct I want the vertex set for each source vertice in the graph.
Which of course leads me on to my next question is to add a level to each of
these.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n6383/image1.jpg
For example the image shows the in and out links of the
I using create table bigtable002 tblproperties('shark.cache'='tachyon')
as select * from bigtable001 to create table bigtable002; while
bigtable001 is load from hdfs, it's format is text file , so i think
bigtable002's is text.
2014-05-26 11:14 GMT+08:00 Aaron Davidson ilike...@gmail.com:
27 matches
Mail list logo