Re: diamond dependency tree

2014-09-19 Thread Victor Tso-Guillen
Yes, sorry I meant DAG. I fixed it in my message but not the subject. The terminology of leaf wasn't helpful I know so hopefully my visual example was enough. Anyway, I noticed what you said in a local-mode test. I can try that in a cluster, too. Thank you! On Thu, Sep 18, 2014 at 10:28 PM,

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-19 Thread Evan Chan
Sweet, that's probably it. Too bad it didn't seem to make 1.1? On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust mich...@databricks.com wrote: The unknown slowdown might be addressed by https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd On Sun, Sep 14, 2014 at

Re: Example of Geoprocessing with Spark

2014-09-19 Thread Evan Chan
Hi Abel, Pretty interesting. May I ask how big is your point CSV dataset? It seems you are relying on searching through the FeatureCollection of polygons for which one intersects your point. This is going to be extremely slow. I highly recommend using a SpatialIndex, such as the many that

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-19 Thread mohan.gadm
Thanks for the info frank. Twitter's-chill avro serializer looks great. But how does spark identifies it as serializer, as its not extending from KryoSerializer. (sorry scala is an alien lang for me). - Thanks Regards, Mohan -- View this message in context:

Time difference between Python and Scala

2014-09-19 Thread Luis Guerra
Hello everyone, What should be the normal time difference between Scala and Python using Spark? I mean running the same program in the same cluster environment. In my case I am using numpy array structures for the Python code and vectors for the Scala code, both for handling my data. The time

Powered By Spark

2014-09-19 Thread Alexander Albul
Hello! Could you please add us to your powered by page? Project name: Ubix.io Link: http://ubix.io Components: Spark, Shark, Spark SQL, MLib, GraphX, Spark Streaming, Adam project Description: blank for now

Re: Odd error when using a rdd map within a stream map

2014-09-19 Thread Filip Andrei
Hey, i don't think that's the issue, foreach is called on 'results' which is a DStream of floats, so naturally it passes RDDs to its function. And either way, changing the code in the first mapper to comment out the map reduce process on the RDD Float f = 1.0f; //nnRdd.map(new FunctionNeuralNet,

basic streaming question

2014-09-19 Thread motte1988
Hello everybody, I'm new to spark streaming and played a bit around with WordCount and a PageRank-Algorithm in a cluster-environment. Am I right, that in the cluster each executor computes data stream separately? And that the result of each executor is independent of the other executors? In the

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-19 Thread Paul Wais
Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just making Kryo use the JavaSerializer for my messages. The resulting stack trace made it look like protobuf GeneratedMessageLite is actually using the

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-19 Thread Paul Wais
Derp, one caveat to my solution: I guess Spark doesn't use Kryo for Function serde :( On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais pw...@yelp.com wrote: Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just

rsync problem

2014-09-19 Thread rapelly kartheek
Hi, I'd made some modifications to the spark source code in the master and reflected them to the slaves using rsync. I followed this command: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname :path/to/destdirectory. This worked perfectly. But, I wanted to simultaneously

Re: rsync problem

2014-09-19 Thread Tobias Pfeiffer
Hi, On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: This worked perfectly. But, I wanted to simultaneously rsync all the slaves. So, added the other slaves as following: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname

Re: rsync problem

2014-09-19 Thread rapelly kartheek
Hi Tobias, I've copied the files from master to all the slaves. On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: This worked perfectly. But, I wanted to simultaneously rsync all

Re: rsync problem

2014-09-19 Thread rapelly kartheek
, * you have copied a lot of files from various hosts to username@slave3:path* only from one node to all the other nodes... On Fri, Sep 19, 2014 at 1:45 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi Tobias, I've copied the files from master to all the slaves. On Fri, Sep 19, 2014

Re: Spark Streaming and ReactiveMongo

2014-09-19 Thread t1ny
Here's what we've tried so far as a first example of a custom Mongo receiver : /class MongoStreamReceiver(host: String) extends NetworkReceiver[String] { protected lazy val blocksGenerator: BlockGenerator = new BlockGenerator(StorageLevel.MEMORY_AND_DISK_SER_2) protected def onStart()

Fwd: rsync problem

2014-09-19 Thread rapelly kartheek
-- Forwarded message -- From: rapelly kartheek kartheek.m...@gmail.com Date: Fri, Sep 19, 2014 at 1:51 PM Subject: Re: rsync problem To: Tobias Pfeiffer t...@preferred.jp any idea why the cluster is dying down??? On Fri, Sep 19, 2014 at 1:47 PM, rapelly kartheek

Re: Spark + Mahout

2014-09-19 Thread Sean Owen
No, it is actually a quite different 'alpha' project under the same name: linear algebra DSL on top of H2O and also Spark. It is not really about algorithm implementations now. On Sep 19, 2014 1:25 AM, Matthew Farrellee m...@redhat.com wrote: On 09/18/2014 05:40 PM, Sean Owen wrote: No, the

Re: PairRDD's lookup method Performance

2014-09-19 Thread Sean Owen
The product of each mapPartitions call can be an Iterable of one big Map. You still need to write some extra custom code like what lookup() does to exploit this data structure. On Sep 18, 2014 11:07 PM, Harsha HN 99harsha.h@gmail.com wrote: Hi All, My question is related to improving

Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Hi, Is there a way to bulk-load to HBase from RDD? HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but I cannot figure out how to use it with saveAsHadoopDataset. Thanks.

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Hi, After reading several documents, it seems that saveAsHadoopDataset cannot use HFileOutputFormat. It's because saveAsHadoopDataset method uses JobConf, so it belongs to the old Hadoop API, while HFileOutputFormat is a member of mapreduce package which is for the new Hadoop API. Am I

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Hi, Sorry, I just found saveAsNewAPIHadoopDataset. Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there any example code for that? Thanks. From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] Sent: Friday, September 19, 2014 8:18 PM To:

Re: Spark + Mahout

2014-09-19 Thread Matthew Farrellee
On 09/19/2014 05:06 AM, Sean Owen wrote: No, it is actually a quite different 'alpha' project under the same name: linear algebra DSL on top of H2O and also Spark. It is not really about algorithm implementations now. On Sep 19, 2014 1:25 AM, Matthew Farrellee m...@redhat.com

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Thank you for the example code. Currently I use foreachPartition() + Put(), but your example code can be used to clean up my code. BTW, since the data uploaded by Put() goes through normal HBase write path, it can be slow. So, it would be nice if bulk-load could be used, since it

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
In fact, it seems that Put can be used by HFileOutputFormat, so Put object itself may not be the problem. The problem is that TableOutputFormat uses the Put object in the normal way (that goes through normal write path), while HFileOutFormat uses it to directly build the HFile. From:

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-19 Thread Aniket Bhatnagar
Apologies in delay in getting back on this. It seems the Kinesis example does not run on Spark 1.1.0 even when it is built using kinesis-acl profile because of a dependency conflict in http client (same issue as

Re: Bulk-load to HBase

2014-09-19 Thread Aniket Bhatnagar
Agreed that the bulk import would be faster. In my case, I wasn't expecting a lot of data to be uploaded to HBase and also, I didn't want to take the pain of importing generated HFiles into HBase. Is there a way to invoke HBase HFile import batch script programmatically? On 19 September 2014

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-19 Thread Frank Austin Nothaft
Hi Mohan, It’s a bit convoluted to follow in their source, but they essentially typedef KSerializer as being a KryoSerializer, and then their serializers all extend KSerializer. Spark should identify them properly as Kryo Serializers, but I haven’t tried it myself. Regards, Frank Austin

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-19 Thread RJ Nowling
Jatin, If you file the JIRA and don't want to work on it, I'd be happy to step in and take a stab at it. RJ On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in

Anyone have successful recipe for spark cassandra connector?

2014-09-19 Thread gzoller
I'm running out of options trying to integrate cassandra, spark, and the spark-cassandra-connector. I quickly found out just grabbing the latest versions of everything (drivers, etc.) doesn't work--binary incompatibilities it would seem. So last I tried using versions of drivers from the

Re: Support R in Spark

2014-09-19 Thread oppokui
Thanks, Shivaram. Kui On Sep 19, 2014, at 12:58 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: As R is single-threaded, SparkR launches one R process per-executor on the worker side. Thanks Shivaram On Thu, Sep 18, 2014 at 7:49 AM, oppokui oppo...@gmail.com wrote:

Re: Spark Streaming and ReactiveMongo

2014-09-19 Thread Soumitra Kumar
onStart should be non-blocking. You may try to create a thread in onStart instead. - Original Message - From: t1ny wbr...@gmail.com To: u...@spark.incubator.apache.org Sent: Friday, September 19, 2014 1:26:42 AM Subject: Re: Spark Streaming and ReactiveMongo Here's what we've tried so

Re: rsync problem

2014-09-19 Thread rapelly kartheek
Thank you Soumya Simantha and Tobias. I've deleted the contents of the work folder in all the nodes. Now its working perfectly as it was before. Thank you Karthik On Fri, Sep 19, 2014 at 4:46 PM, Soumya Simanta soumya.sima...@gmail.com wrote: One possible reason is maybe that the checkpointing

Re: Cannot run SimpleApp as regular Java app

2014-09-19 Thread ericacm
It turns out that it was the Hadoop version that was the issue. spark-1.0.2-hadoop1 and spark-1.1.0-hadoop1 both work. spark.1.0.2-hadoop2, spark-1.1.0-hadoop2.4 and spark-1.1.0-hadoop2.4 do not work. It's strange because for this little test I am not even using HDFS at all. -- Eric On Thu,

Re: paging through an RDD that's too large to collect() all at once

2014-09-19 Thread Dave Anderson
Excellent - thats exactly what I needed. I saw iterator() but missed the toLocalIterator() method -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638p14686.html Sent from the Apache Spark

return probability \ confidence instead of actual class

2014-09-19 Thread Adamantios Corais
Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which

Spark 1.1.0 (w/ hadoop 2.4) versus aws-java-sdk-1.7.2.jar

2014-09-19 Thread tian zhang
Hi, Spark experts, I have the following issue when using aws java sdk in my spark application. Here I narrowed down the following steps to reproduce the problem 1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster 2) from the master node, I did the following steps. spark-shell

Re: spark-submit command-line with --files

2014-09-19 Thread Andrew Or
Hi Chinchu, SparkEnv is an internal class that is only meant to be used within Spark. Outside of Spark, it will be null because there are no executors or driver to start an environment for. Similarly, SparkFiles is meant to be used internally (though it's privacy settings should be modified to

Re: Time difference between Python and Scala

2014-09-19 Thread Davies Liu
I think it's normal. On Fri, Sep 19, 2014 at 12:07 AM, Luis Guerra luispelay...@gmail.com wrote: Hello everyone, What should be the normal time difference between Scala and Python using Spark? I mean running the same program in the same cluster environment. In my case I am using numpy array

RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Andy Davidson
Hi I am wrote a little java job to try and figure out how RDD pipe works. Bellow is my test shell script. If in the script I turn on debugging I get output. In my console. If debugging is turned off in the shell script, I do not see anything in my console. Is this a bug or feature? I am running

mllib performance on mesos cluster

2014-09-19 Thread SK
Hi, I have a program similar to the BinaryClassifier example that I am running using my data (which is fairly small). I run this for 100 iterations. I observed the following performance: Standalone mode cluster with 10 nodes (with Spark 1.0.2): 5 minutes Standalone mode cluster with 10 nodes

Re: RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Sean Owen
What is in 'rdd' here, to double check? Do you mean the spark shell when you say console? At the end you're grepping output from some redirected output but where is that from? On Sep 19, 2014 7:21 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am wrote a little java job to try and

Re: spark-submit command-line with --files

2014-09-19 Thread Andrew Or
Hey just a minor clarification, you _can_ use SparkFiles.get in your application only if it runs on the executors, e.g. in the following way: sc.parallelize(1 to 100).map { i = SparkFiles.get(my.file) }.collect() But not in general (otherwise NPE, as in your case). Perhaps this should be

Spark Streaming compilation error: algebird not a member of package com.twitter

2014-09-19 Thread SK
Hi, I am using the latest release Spark 1.1.0. I am trying to build the streaming examples (under examples/streaming) as a standalone project with the following streaming.sbt file. When I run sbt assembly, I get an error stating that object algebird is not a member of package com.twitter. I

Re: Example of Geoprocessing with Spark

2014-09-19 Thread Abel Coronado Iruegas
Hi Evan, here a improved version, thanks for your advice. But you know the last step, the SaveAsTextFile is very Slw, :( import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import java.net.URL import java.text.SimpleDateFormat import

Re: Bulk-load to HBase

2014-09-19 Thread Soumitra Kumar
I successfully did this once. RDD map to RDD [(ImmutableBytesWritable, KeyValue)] then val conf = HBaseConfiguration.create() val job = new Job (conf, CEF2HFile) job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]); job.setMapOutputValueClass (classOf[KeyValue]); val table = new

Re: Failed running Spark ALS

2014-09-19 Thread Nick Pentreath
Have you set spark.local.dir (I think this is the config setting)? It needs to point to a volume with plenty of space. By default if I recall it point to /tmp Sent from my iPhone On 19 Sep 2014, at 23:35, jw.cmu jinliangw...@gmail.com wrote: I'm trying to run Spark ALS using the netflix

Re: Better way to process large image data set ?

2014-09-19 Thread Evan Chan
What Sean said. You should also definitely turn on Kryo serialization. The default Java serialization is really really slow if you're gonna move around lots of data.Also make sure you use a cluster with high network bandwidth on. On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen so...@cloudera.com