Re: Java heap space and spark.akka.frameSize Inbox x

2014-04-21 Thread Chieh-Yen
Can anybody help me? Thanks. Chieh-Yen On Wed, Apr 16, 2014 at 5:18 PM, Chieh-Yen r01944...@csie.ntu.edu.twwrote: Dear all, I developed a application that the message size of communication is greater than 10 MB sometimes. For smaller datasets it works fine, but fails for larger datasets.

Re: Java heap space and spark.akka.frameSize Inbox x

2014-04-21 Thread Akhil Das
Hi Chieh, You can increase the heap size by exporting the java options (See below, will increase the heap size to 10Gb) export _JAVA_OPTIONS=-Xmx10g On Mon, Apr 21, 2014 at 11:43 AM, Chieh-Yen r01944...@csie.ntu.edu.twwrote: Can anybody help me? Thanks. Chieh-Yen On Wed, Apr 16, 2014

Re: running tests selectively

2014-04-21 Thread Mark Hamstra
You should add the hub command line wrapper of git for github to that wiki page: https://github.com/github/hub -- doesn't look like I have edit access to the wiki, or I've forgotten a password, or something Once you've got hub installed and aliased, you've got some nice additional options,

Re: Task splitting among workers

2014-04-21 Thread Arpit Tak
1.) How about if data is in S3 and we cached in memory , instead of hdfs ? 2.) How is the numbers of reducers determined in both case . Even if I specify set.mapred.reduce.tasks=50, still somehow reducers allocated are only 2, instead of 50. Although query/tasks gets completed. Regards, Arpit

Re: Are there any plans to develop Graphx Streaming?

2014-04-21 Thread Ankur Dave
On Sun, Apr 20, 2014 at 6:27 PM, Qi Song songqi1...@gmail.com wrote: I wander if there exists some documentation about how to choose partition methods, based on the graph's structure or some other properties? The best option is to try all the partition strategies (as well as the default,

Re: [GraphX] Cast error when comparing a vertex attribute after its type has changed

2014-04-21 Thread Ankur Dave
On Fri, Apr 11, 2014 at 4:42 AM, Pierre-Alexandre Fonta pierre.alexandre.fonta+sp...@gmail.com wrote: Testing in mapTriplets if a vertex attribute, which is defined as Integer in first VertexRDD but has been changed after to Double by mapVertices, is greater than a number throws

Re: Do I need to learn Scala for spark ?

2014-04-21 Thread Pulasthi Supun Wickramasinghe
Hi, I think you can do just fine with your Java knowledge. There is a Java API that you can use [1]. I am also new to Spark and i have got around with just my Java knowledge. And Scala is easy to learn if you are good with Java. [1] http://spark.apache.org/docs/latest/java-programming-guide.html

custom kryoserializer class under mesos

2014-04-21 Thread Soren Macbeth
Hello, Is it possible to use a custom class as my spark's KryoSerializer running under Mesos? I've tried adding my jar containing the class to my spark context (via SparkConf.addJars), but I always get: java.lang.ClassNotFoundException: flambo.kryo.FlamboKryoSerializer at

Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-21 Thread Sandy Ryza
Hi Christophe, Adding the jars to both SPARK_CLASSPATH and ADD_JARS is required. The former makes them available to the spark-shell driver process, and the latter tells Spark to make them available to the executor processes running on the cluster. -Sandy On Wed, Apr 16, 2014 at 9:27 AM,

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Marcelo Vanzin
Hi Sung, On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: The goal is to keep an intermediate value per row in memory, which would allow faster subsequent computations. I.e., computeSomething would depend on the previous value from the previous computation. I

spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu
Hi, all I’m writing a Spark application to load S3 data to HDFS, the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0 after I execute val allfiles = sc.textFile(s3n://abc/*.txt”) val output = allfiles.saveAsTextFile(hdfs://x.x.x.x:9000/dataset”) Spark throws exception:

checkpointing without streaming?

2014-04-21 Thread Diana Carroll
I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there

Re: checkpointing without streaming?

2014-04-21 Thread Xiangrui Meng
Checkpoint clears dependencies. You might need checkpoint to cut a long lineage in iterative algorithms. -Xiangrui On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote: I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk.

Re: Spark is slow

2014-04-21 Thread Marcelo Vanzin
Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote: And, I haven't gotten any answers to my questions. One thing that might explain that is that, at least for me, all (and I mean *all*) of your messages are ending up in my GMail spam folder, complaining that GMail can't

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Sung Hwan Chung
I would probably agree that it's typically not a good idea to add states to distributed systems. Additionally, from a purist's perspective, this would be a bit of hacking to the paradigm. However, from a practical point of view, I think that it's a reasonable trade-off between efficiency and

Re: Spark is slow

2014-04-21 Thread John Meagher
Yahoo made some changes that drive mailing list posts into spam folders: http://www.virusbtn.com/blog/2014/04_15.xml On Mon, Apr 21, 2014 at 2:50 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote: And, I haven't gotten any

Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
I'm trying to get my feet wet with Spark. I've done some simple stuff in the shell in standalone mode, and now I'm trying to connect to HDFS resources, but I'm running into a problem. I synced to git's master branch (c399baa - SPARK-1456 Remove view bounds on Ordered in favor of a context

Re: checkpointing without streaming?

2014-04-21 Thread Diana Carroll
When might that be necessary or useful? Presumably I can persist and replicate my RDD to avoid re-computation, if that's my goal. What advantage does checkpointing provide over disk persistence with replication? On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng men...@gmail.com wrote:

Re: Spark is slow

2014-04-21 Thread Nicholas Chammas
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my Spam folder. :( With regards to your questions, I would suggest in general adding some more technical detail to them. It will be difficult for people to give you suggestions if all they are told is Spark is slow. How does

Re: Spark is slow

2014-04-21 Thread Sam Bessalah
Why don't start by explaining what kind of operation you're running on spark that's faster than hadoop mapred. Mybewe could start there. And yes this mailing is very busy since many people are getting into Spark, it's hard to answer to everyone. On 21 Apr 2014 20:23, Joe L selme...@yahoo.com

Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Nicholas Chammas
I'm looking to start experimenting with Spark Streaming, and I'd like to use Amazon Kinesis https://aws.amazon.com/kinesis/ as my data source. Looking at the list of supported Spark Streaming sourceshttp://spark.apache.org/docs/latest/streaming-programming-guide.html#linking, I don't see any

Re: BFS implemented

2014-04-21 Thread Mayur Rustagi
would be good if you can contribute this as an example. BFS is a common enough algo. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, Apr 19, 2014 at 4:16 AM, Ghufran Malik gooksma...@gmail.com wrote: Ahh nvm I found

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Matei Zaharia
There was a patch posted a few weeks ago (https://github.com/apache/spark/pull/223), but it needs a few changes in packaging because it uses a license that isn’t fully compatible with Apache. I’d like to get this merged when the changes are made though — it would be a good input source to

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-21 Thread Tathagata Das
Are you by any chance starting two StreamingContexts in the same JVM? That could explain a lot of the weird mixing of data that you are seeing. Its not a supported usage scenario to start multiple streamingContexts simultaneously in the same JVM. TD On Thu, Apr 17, 2014 at 10:58 PM, gaganbm

Re: question about the SocketReceiver

2014-04-21 Thread Tathagata Das
As long as the socket server sends data through the same connection, the existing code is going to work. The socket.getInputStream returns a input stream which will continuously allow you to pull data sent over the connection. The bytesToObject function continuously reads data from the input

RE: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
I figured it out - I should be using textFile(...), not hadoopFile(...). And my HDFS URL should include the host: hdfs://host/user/kwilliams/corTable2/part-m-0 I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the

Re: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Marcelo Vanzin
Hi Ken, On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken ken.willi...@windlogics.com wrote: I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but that's not so important. Try adding

Re: checkpointing without streaming?

2014-04-21 Thread Tathagata Das
Diana, that is a good question. When you persist an RDD, the system still remembers the whole lineage of parent RDDs that created that RDD. If one of the executor fails, and the persist data is lost (both local disk and memory data will get lost), then the lineage is used to recreate the RDD. The

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Parviz Deyhim
it is possible Nick. Please take a look here: https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 the source code is here as a pull request: https://github.com/apache/spark/pull/223 let me know if you have any questions. On Mon, Apr 21, 2014 at 1:00 PM, Nicholas Chammas

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Parviz Deyhim
sorry Matei. Will definitely start working on making the changes soon :) On Mon, Apr 21, 2014 at 1:10 PM, Matei Zaharia matei.zaha...@gmail.comwrote: There was a patch posted a few weeks ago ( https://github.com/apache/spark/pull/223), but it needs a few changes in packaging because it uses

RE: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
-Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Hi Ken, On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken ken.willi...@windlogics.com wrote: I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the

ERROR TaskSchedulerImpl: Lost an executor

2014-04-21 Thread jaeholee
Hi, I am trying to set up my own standalone Spark, and I started the master node and worker nodes. Then I ran ./bin/spark-shell, and I get this message: 14/04/21 16:31:51 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): remote Akka client disassociated 14/04/21 16:31:51 ERROR

Re: Hung inserts?

2014-04-21 Thread Brad Heller
I tried removing the CLUSTERED directive and get the same results :( I also removed SORTED, same deal. I'm going to try removign partitioning all together for now. On Mon, Apr 21, 2014 at 4:58 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Clustering is not supported. Can you remove that

Re: Hung inserts?

2014-04-21 Thread Brad Heller
So after a little more investigation it turns out this issue happens specifically when I interact with shark server. If I log in to the master and start a shark session (./bin/shark), everything works as expected. i'm starting shark server with the following upstart script, am I doing something

Re: [ann] Spark-NYC Meetup

2014-04-21 Thread Sam Bessalah
Sounds great François. On 21 Apr 2014 22:31, François Le Lay f...@spotify.com wrote: Hi everyone, This is a quick email to announce the creation of a Spark-NYC Meetup. We have 2 upcoming events, one at PlaceIQ, another at Spotify where Reynold Xin (Databricks) and Christopher Johnson

Re: spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Parviz Deyhim
I ran into the same issue. The problem seems to be with the jets3t library that Spark uses in project/SparkBuild.scala. change this: net.java.dev.jets3t % jets3t % 0.7.1 to net.java.dev.jets3t % jets3t % 0.9.0 0.7.1 is not the right version of jets3t for Hadoop

Re: spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu
Yes, I fixed in the same way, but didn’t get a change to get back to here I also made a PR: https://github.com/apache/spark/pull/468 Best, -- Nan Zhu On Monday, April 21, 2014 at 8:19 PM, Parviz Deyhim wrote: I ran into the same issue. The problem seems to be with the jets3t library

Adding to an RDD

2014-04-21 Thread Ian Ferreira
Feels like a silly questions, But what if I wanted to apply a map to each element in a RDD, but instead of replacing it, I wanted to add new columns of the manipulate value I.e. res0: Array[String] = Array(1 2, 1 3, 1 4, 2 1, 3 1, 4 1) Becomes res0: Array[String] = Array(1 2 2 4, 1 3 1 6,

Re: Spark is slow

2014-04-21 Thread Joe L
g1 = pairs1.groupByKey().count() pairs1 = pairs1.groupByKey(g1).cache() g2 = triples.groupByKey().count() pairs2 = pairs2.groupByKey(g2) pairs = pairs2.join(pairs1) Hi, I want to implement hash-partitioned joining as shown above. But somehow, it is taking so long to perform. As I

two calls of saveAsTextFile() have different results on the same RDD

2014-04-21 Thread randylu
i just call saveAsTextFile() twice. 'doc_topic_dist' is type of RDD[(Long, Array[Int])], each element is pair of (doc, topic_arr), for the same doc, they have different of topic_arr in two files. ... doc_topic_dist.coalesce(1, true).saveAsTextFile(save_path)

how to solve this problem?

2014-04-21 Thread gogototo
14/04/22 10:43:45 WARN scheduler.TaskSetManager: Loss was due to java.util.NoSuchElementException java.util.NoSuchElementException: End of stream at org.apache.spark.util.NextIterator.next(NextIterator.scala:83) at

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-21 Thread randylu
it's ok when i call doc_topic_dist.cache() firstly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Adding to an RDD

2014-04-21 Thread Mark Hamstra
As long as the function that you are mapping over the RDD is pure, preserving referential transparency so that anytime you map the same function over the same initial RDD elements you get the same result elements, then there is no problem in doing what you suggest. In fact, it's common practice.

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-21 Thread gaganbm
Yes. I am running this in a local mode and the SSCs run on the same JVM. So, if I deploy this on a cluster, such behavior would be gone ? Also, is there anyway I can start the SSCs on a local machine but on different JVMs? I couldn't find anything about this in the documentation. The

Need clarification of joining streams

2014-04-21 Thread gaganbm
I wanted some clarification on the behavior of join streams. As I believe, the join works per batch. I am reading data from two Kafka streams and then joining them based on some keys. But what happens if one stream hasn't produced any data in that batch duration, and the other has some ? Or lets

Re: Spark recovery from bad nodes

2014-04-21 Thread Praveen R
Please check my comment on the shark-users threadhttps://groups.google.com/forum/#!searchin/shark-users/Failure$20recovery$20in$20Shark$20when$20cluster$20/shark-users/vUUGLZANxr8/MMCtKhqjhLMJ . On Tue, Apr 22, 2014 at 8:06 AM, rama0120 lakshminaarayana...@gmail.comwrote: Hi, I couldn't find