Re: spark-submit command-line with --files

2014-09-19 Thread chinchu
Thanks Andrew. that helps On Fri, Sep 19, 2014 at 5:47 PM, Andrew Or-2 [via Apache Spark User List] < ml-node+s1001560n14708...@n3.nabble.com> wrote: > Hey just a minor clarification, you _can_ use SparkFiles.get in your > application only if it runs on the executors, e.g. in the following way: >

How to Exclude Spark Dependencies from spark-streaming-kafka?

2014-09-19 Thread Ji ZHANG
Hi, I'm developing an application with spark-streaming-kafka, which depends on spark-streaming and kafka. Since spark-streaming is provided in runtime, I want to exclude the jars from the assembly. I tried the following configuration: libraryDependencies ++= { val sparkVersion = "1.0.2" Seq(

Re: Reproducing the function of a Hadoop Reducer

2014-09-19 Thread Victor Tso-Guillen
So sorry about teasing you with the Scala. But the method is there in Java too, I just checked. On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen wrote: > It might not be the same as a real hadoop reducer, but I think it would > accomplish the same. Take a look at: > > import org.apache.spark.

Spark Streaming: Calculate PV/UV by Minute and by Day?

2014-09-19 Thread Ji ZHANG
Hi, I'm using Spark Streaming 1.0. Say I have a source of website click stream, like the following: ('2014-09-19 00:00:00', '192.168.1.1', 'home_page') ('2014-09-19 00:00:01', '192.168.1.2', 'list_page') ... And I want to calculate the page views (PV, number of logs) and unique user (UV, identi

Re: Better way to process large image data set ?

2014-09-19 Thread Evan Chan
What Sean said. You should also definitely turn on Kryo serialization. The default Java serialization is really really slow if you're gonna move around lots of data.Also make sure you use a cluster with high network bandwidth on. On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen wrote: > Base 64 i

Re: Failed running Spark ALS

2014-09-19 Thread Nick Pentreath
Have you set spark.local.dir (I think this is the config setting)? It needs to point to a volume with plenty of space. By default if I recall it point to /tmp Sent from my iPhone > On 19 Sep 2014, at 23:35, "jw.cmu" wrote: > > I'm trying to run Spark ALS using the netflix dataset but failed d

Re: Bulk-load to HBase

2014-09-19 Thread Soumitra Kumar
I successfully did this once. RDD map to RDD [(ImmutableBytesWritable, KeyValue)] then val conf = HBaseConfiguration.create() val job = new Job (conf, "CEF2HFile") job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]); job.setMapOutputValueClass (classOf[KeyValue]); val table = new HTable(con

Re: Example of Geoprocessing with Spark

2014-09-19 Thread Abel Coronado Iruegas
Hi Evan, here a improved version, thanks for your advice. But you know the last step, the SaveAsTextFile is very Slw, :( import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import java.net.URL import java.text.SimpleDateFormat import c

Spark Streaming compilation error: algebird not a member of package com.twitter

2014-09-19 Thread SK
Hi, I am using the latest release Spark 1.1.0. I am trying to build the streaming examples (under examples/streaming) as a standalone project with the following streaming.sbt file. When I run sbt assembly, I get an error stating that object algebird is not a member of package com.twitter. I tried

Re: spark-submit command-line with --files

2014-09-19 Thread Andrew Or
Hey just a minor clarification, you _can_ use SparkFiles.get in your application only if it runs on the executors, e.g. in the following way: sc.parallelize(1 to 100).map { i => SparkFiles.get("my.file") }.collect() But not in general (otherwise NPE, as in your case). Perhaps this should be docum

Re: RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Jey Kottalam
Your proposed use of rdd.pipe("foo") to communicate with an external process seems fine. The "foo" program should read its input from stdin, perform its computations, and write its results back to stdout. Note that "foo" will be run on the workers, invoked once per partition, and the result will be

Re: Unable to load app logs for MLLib programs in history server

2014-09-19 Thread SK
I have created JIRA ticket 3610 for the issue. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-app-logs-for-MLLib-programs-in-history-server-tp14627p14706.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Andy Davidson
Hi Jey Many thanks for the code example. Here is what I really want to do. I want to use Spark Stream and python. Unfortunately pySpark does not support streams yet. It was suggested the way to work around this was to use an RDD pipe. The example bellow was a little experiment. You can think of m

Failed running Spark ALS

2014-09-19 Thread jw.cmu
I'm trying to run Spark ALS using the netflix dataset but failed due to "No space on device" exception. It seems the exception is thrown after the training phase. It's not clear to me what is being written and where is the output directory. I was able to run the same code on the provided test.data

Re: Bulk-load to HBase

2014-09-19 Thread Ted Yu
Please see http://hbase.apache.org/book.html#completebulkload LoadIncrementalHFiles has a main() method. On Fri, Sep 19, 2014 at 5:41 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Agreed that the bulk import would be faster. In my case, I wasn't > expecting a lot of data to be upl

Re: Reproducing the function of a Hadoop Reducer

2014-09-19 Thread Victor Tso-Guillen
It might not be the same as a real hadoop reducer, but I think it would accomplish the same. Take a look at: import org.apache.spark.SparkContext._ // val rdd: RDD[(K, V)] // def zero(value: V): S // def reduce(agg: S, value: V): S // def merge(agg1: S, agg2: S): S val reducedUnsorted: RDD[(K, S)]

Re: Problem with giving memory to executors on YARN

2014-09-19 Thread Vipul Pandey
How many cores do you have in your boxes? looks like you are assigning 32 cores "per" executor - is that what you want? are there other applications running on the cluster? you might want to check YARN UI to see how many containers are getting allocated to your application. On Sep 19, 2014, a

Problem with giving memory to executors on YARN

2014-09-19 Thread Soumya Simanta
I'm launching a Spark shell with the following parameters ./spark-shell --master yarn-client --executor-memory 32g --driver-memory 4g --executor-cores 32 --num-executors 8 but when I look at the Spark UI it shows only 209.3 GB total memory. Executors (10) - *Memory:* 55.9 GB Used (209.3 GB

Reproducing the function of a Hadoop Reducer

2014-09-19 Thread Steve Lewis
I am struggling to reproduce the functionality of a Hadoop reducer on Spark (in Java) in Hadoop I have a function public void doReduce(K key, Iterator values) in Hadoop there is also a consumer (context write) which can be seen as consume(key,value) In my code 1) knowing the key is important to

Re: RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Jey Kottalam
Hi Andy, That's a feature -- you'll have to print out the return value from collect() if you want the contents to show up on stdout. Probably something like this: for(Iterator iter = rdd.pipe(pwd + "/src/main/bin/RDDPipe.sh").collect().iterator(); iter.hasNext();) System.out.println(iter.next

Re: RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Sean Owen
What is in 'rdd' here, to double check? Do you mean the spark shell when you say console? At the end you're grepping output from some redirected output but where is that from? On Sep 19, 2014 7:21 PM, "Andy Davidson" wrote: > Hi > > I am wrote a little java job to try and figure out how RDD pipe

mllib performance on mesos cluster

2014-09-19 Thread SK
Hi, I have a program similar to the BinaryClassifier example that I am running using my data (which is fairly small). I run this for 100 iterations. I observed the following performance: Standalone mode cluster with 10 nodes (with Spark 1.0.2): 5 minutes Standalone mode cluster with 10 nodes (wi

RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Andy Davidson
Hi I am wrote a little java job to try and figure out how RDD pipe works. Bellow is my test shell script. If in the script I turn on debugging I get output. In my console. If debugging is turned off in the shell script, I do not see anything in my console. Is this a bug or feature? I am running t

Re: Time difference between Python and Scala

2014-09-19 Thread Davies Liu
I think it's normal. On Fri, Sep 19, 2014 at 12:07 AM, Luis Guerra wrote: > Hello everyone, > > What should be the normal time difference between Scala and Python using > Spark? I mean running the same program in the same cluster environment. > > In my case I am using numpy array structures for t

Re: spark-submit command-line with --files

2014-09-19 Thread Andrew Or
Hi Chinchu, SparkEnv is an internal class that is only meant to be used within Spark. Outside of Spark, it will be null because there are no executors or driver to start an environment for. Similarly, SparkFiles is meant to be used internally (though it's privacy settings should be modified to ref

Spark 1.1.0 (w/ hadoop 2.4) versus aws-java-sdk-1.7.2.jar

2014-09-19 Thread tian zhang
Hi, Spark experts, I have the following issue when using aws java sdk in my spark application. Here I narrowed down the following steps to reproduce the problem 1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster 2) from the master node, I did the following steps. spark-shell --

return probability \ confidence instead of actual class

2014-09-19 Thread Adamantios Corais
Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which

Re: paging through an RDD that's too large to collect() all at once

2014-09-19 Thread Dave Anderson
Excellent - thats exactly what I needed. I saw iterator() but missed the toLocalIterator() method -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638p14686.html Sent from the Apache Spark Us

Re: Cannot run SimpleApp as regular Java app

2014-09-19 Thread ericacm
It turns out that it was the Hadoop version that was the issue. spark-1.0.2-hadoop1 and spark-1.1.0-hadoop1 both work. spark.1.0.2-hadoop2, spark-1.1.0-hadoop2.4 and spark-1.1.0-hadoop2.4 do not work. It's strange because for this little test I am not even using HDFS at all. -- Eric On Thu,

Re: rsync problem

2014-09-19 Thread rapelly kartheek
Thank you Soumya Simantha and Tobias. I've deleted the contents of the work folder in all the nodes. Now its working perfectly as it was before. Thank you Karthik On Fri, Sep 19, 2014 at 4:46 PM, Soumya Simanta wrote: > One possible reason is maybe that the checkpointing directory > $SPARK_HOME

Re: Spark Streaming and ReactiveMongo

2014-09-19 Thread Soumitra Kumar
onStart should be non-blocking. You may try to create a thread in onStart instead. - Original Message - From: "t1ny" To: u...@spark.incubator.apache.org Sent: Friday, September 19, 2014 1:26:42 AM Subject: Re: Spark Streaming and ReactiveMongo Here's what we've tried so far as a first e

Re: Support R in Spark

2014-09-19 Thread oppokui
Thanks, Shivaram. Kui > On Sep 19, 2014, at 12:58 AM, Shivaram Venkataraman > wrote: > > As R is single-threaded, SparkR launches one R process per-executor on > the worker side. > > Thanks > Shivaram > > On Thu, Sep 18, 2014 at 7:49 AM, oppokui wrote: >> Shivaram, >> >> As I know, SparkR

Anyone have successful recipe for spark cassandra connector?

2014-09-19 Thread gzoller
I'm running out of options trying to integrate cassandra, spark, and the spark-cassandra-connector. I quickly found out just grabbing the latest versions of everything (drivers, etc.) doesn't work--binary incompatibilities it would seem. So last I tried using versions of drivers from the spark-ca

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-19 Thread RJ Nowling
Jatin, If you file the JIRA and don't want to work on it, I'd be happy to step in and take a stab at it. RJ On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng wrote: > Hi Jatin, > > HashingTF should be able to solve the memory problem if you use a > small feature dimension in HashingTF. Please do

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-19 Thread Frank Austin Nothaft
Hi Mohan, It’s a bit convoluted to follow in their source, but they essentially typedef KSerializer as being a KryoSerializer, and then their serializers all extend KSerializer. Spark should identify them properly as Kryo Serializers, but I haven’t tried it myself. Regards, Frank Austin Notha

Re: Bulk-load to HBase

2014-09-19 Thread Aniket Bhatnagar
Agreed that the bulk import would be faster. In my case, I wasn't expecting a lot of data to be uploaded to HBase and also, I didn't want to take the pain of importing generated HFiles into HBase. Is there a way to invoke HBase HFile import batch script programmatically? On 19 September 2014 17:58

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-19 Thread Aniket Bhatnagar
Apologies in delay in getting back on this. It seems the Kinesis example does not run on Spark 1.1.0 even when it is built using kinesis-acl profile because of a dependency conflict in http client (same issue as http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5j

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
In fact, it seems that Put can be used by HFileOutputFormat, so Put object itself may not be the problem. The problem is that TableOutputFormat uses the Put object in the normal way (that goes through normal write path), while HFileOutFormat uses it to directly build the HFile. From: innowi

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Thank you for the example code. Currently I use foreachPartition() + Put(), but your example code can be used to clean up my code. BTW, since the data uploaded by Put() goes through normal HBase write path, it can be slow. So, it would be nice if bulk-load could be used, since it bypasse

Re: Bulk-load to HBase

2014-09-19 Thread Aniket Bhatnagar
I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat instead of HFileOutputFormat. But, hopefully this should help you: val hbaseZookeeperQuorum = s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath" val conf = HBaseConfiguration.create() conf.set("hbase.zookeeper.quorum", hbase

Re: Spark + Mahout

2014-09-19 Thread Matthew Farrellee
On 09/19/2014 05:06 AM, Sean Owen wrote: No, it is actually a quite different 'alpha' project under the same name: linear algebra DSL on top of H2O and also Spark. It is not really about algorithm implementations now. On Sep 19, 2014 1:25 AM, "Matthew Farrellee" mailto:m...@redhat.com>> wrote:

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Hi, Sorry, I just found saveAsNewAPIHadoopDataset. Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there any example code for that? Thanks. From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] Sent: Friday, September 19, 2014 8:18 PM To: user@spark

RE: Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Hi, After reading several documents, it seems that saveAsHadoopDataset cannot use HFileOutputFormat. It's because saveAsHadoopDataset method uses JobConf, so it belongs to the old Hadoop API, while HFileOutputFormat is a member of mapreduce package which is for the new Hadoop API. Am I rig

Re: rsync problem

2014-09-19 Thread Soumya Simanta
One possible reason is maybe that the checkpointing directory $SPARK_HOME/work is rsynced as well. Try emptying the contents of the work folder on each node and try again. On Fri, Sep 19, 2014 at 4:53 AM, rapelly kartheek wrote: > I > * followed this command:rsync -avL --progress path/to/spark

Bulk-load to HBase

2014-09-19 Thread innowireless TaeYun Kim
Hi, Is there a way to bulk-load to HBase from RDD? HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but I cannot figure out how to use it with saveAsHadoopDataset. Thanks.

Re: PairRDD's lookup method Performance

2014-09-19 Thread Sean Owen
The product of each mapPartitions call can be an Iterable of one big Map. You still need to write some extra custom code like what lookup() does to exploit this data structure. On Sep 18, 2014 11:07 PM, "Harsha HN" <99harsha.h@gmail.com> wrote: > Hi All, > > My question is related to improving

Re: Spark + Mahout

2014-09-19 Thread Sean Owen
No, it is actually a quite different 'alpha' project under the same name: linear algebra DSL on top of H2O and also Spark. It is not really about algorithm implementations now. On Sep 19, 2014 1:25 AM, "Matthew Farrellee" wrote: > On 09/18/2014 05:40 PM, Sean Owen wrote: > >> No, the architecture

Re: rsync problem

2014-09-19 Thread rapelly kartheek
I * followed this command:rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname:* *path/to/destdirectory. Anyway, for now, I did it individually for each node.* I have copied to each node at a time individually using the above command. So, I guess the copying may not contain any

Re: rsync problem

2014-09-19 Thread Tobias Pfeiffer
Hi, On Fri, Sep 19, 2014 at 5:17 PM, rapelly kartheek wrote: > > , > > * you have copied a lot of files from various hosts to > username@slave3:path* > only from one node to all the other nodes... > I don't think rsync can do that in one command as you described. My guess is that now you have a

Fwd: rsync problem

2014-09-19 Thread rapelly kartheek
-- Forwarded message -- From: rapelly kartheek Date: Fri, Sep 19, 2014 at 1:51 PM Subject: Re: rsync problem To: Tobias Pfeiffer any idea why the cluster is dying down??? On Fri, Sep 19, 2014 at 1:47 PM, rapelly kartheek wrote: > , > > > * you have copied a lot of files from

Re: Spark Streaming and ReactiveMongo

2014-09-19 Thread t1ny
Here's what we've tried so far as a first example of a custom Mongo receiver : /class MongoStreamReceiver(host: String) extends NetworkReceiver[String] { protected lazy val blocksGenerator: BlockGenerator = new BlockGenerator(StorageLevel.MEMORY_AND_DISK_SER_2) protected def onStart()

Re: rsync problem

2014-09-19 Thread rapelly kartheek
, * you have copied a lot of files from various hosts to username@slave3:path* only from one node to all the other nodes... On Fri, Sep 19, 2014 at 1:45 PM, rapelly kartheek wrote: > Hi Tobias, > > I've copied the files from master to all the slaves. > > On Fri, Sep 19, 2014 at 1:37 PM, Tobias

Re: rsync problem

2014-09-19 Thread rapelly kartheek
Hi Tobias, I've copied the files from master to all the slaves. On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer wrote: > Hi, > > On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek > wrote: >> >> This worked perfectly. But, I wanted to simultaneously rsync all the >> slaves. So, added the other

Re: rsync problem

2014-09-19 Thread Tobias Pfeiffer
Hi, On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek wrote: > > This worked perfectly. But, I wanted to simultaneously rsync all the > slaves. So, added the other slaves as following: > > rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname > :path/to/destdirectory username@sl

rsync problem

2014-09-19 Thread rapelly kartheek
Hi, I'd made some modifications to the spark source code in the master and reflected them to the slaves using rsync. I followed this command: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname :path/to/destdirectory. This worked perfectly. But, I wanted to simultaneously rs

Re: Unable to find proto buffer class error with RDD

2014-09-19 Thread Paul Wais
Derp, one caveat to my "solution": I guess Spark doesn't use Kryo for Function serde :( On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais wrote: > Well it looks like this is indeed a protobuf issue. Poked a little more > with Kryo. Since protobuf messages are serializable, I tried just making > Kryo

Re: Unable to find proto buffer class error with RDD

2014-09-19 Thread Paul Wais
Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just making Kryo use the JavaSerializer for my messages. The resulting stack trace made it look like protobuf GeneratedMessageLite is actually using the classloade

Re: Odd error when using a rdd map within a stream map

2014-09-19 Thread Filip Andrei
Hey, i don't think that's the issue, foreach is called on 'results' which is a DStream of floats, so naturally it passes RDDs to its function. And either way, changing the code in the first mapper to comment out the map reduce process on the RDD Float f = 1.0f; //nnRdd.map(new Function() {

basic streaming question

2014-09-19 Thread motte1988
Hello everybody, I'm new to spark streaming and played a bit around with WordCount and a PageRank-Algorithm in a cluster-environment. Am I right, that in the cluster each executor computes data stream separately? And that the result of each executor is independent of the other executors? In the

Powered By Spark

2014-09-19 Thread Alexander Albul
Hello! Could you please add us to your "powered by" page? Project name: Ubix.io Link: http://ubix.io Components: Spark, Shark, Spark SQL, MLib, GraphX, Spark Streaming, Adam project Description:

Time difference between Python and Scala

2014-09-19 Thread Luis Guerra
Hello everyone, What should be the normal time difference between Scala and Python using Spark? I mean running the same program in the same cluster environment. In my case I am using numpy array structures for the Python code and vectors for the Scala code, both for handling my data. The time dif

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-19 Thread mohan.gadm
Thanks for the info frank. Twitter's-chill avro serializer looks great. But how does spark identifies it as serializer, as its not extending from KryoSerializer. (sorry scala is an alien lang for me). - Thanks & Regards, Mohan -- View this message in context: http://apache-spark-user-list.