Re: Spark Streaming not processing file with particular number of entries

2014-06-06 Thread praveshjain1991
Hi, I am using Spark-1.0.0 over a 3 node cluster with 1 master and 2 slaves. I am trying to run LR algorithm over Spark Streaming. package org.apache.spark.examples.streaming; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileWriter; import

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-06 Thread Mayur Rustagi
You can look to create a Dstream directly from S3 location using file stream. If you want to use any specific logic you can rely on Queuestream read data yourself from S3, process it push it into RDDQueue. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Native library can not be loaded when using Mllib PCA

2014-06-06 Thread yangliuyu
Thanks Xiangrui, I switched to a Ubuntu 14.04 server and it works after install liblapack3gf and libopenblas-base. So it is a environment problem which is not related to Mllib. -- View this message in context:

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Oleg Proudnikov
Thank you, Andrew! On 5 June 2014 23:14, Andrew Ash and...@andrewash.com wrote: Oh my apologies that was for 1.0 For Spark 0.9 I did it like this: MASTER=spark://mymaster:7077 SPARK_MEM=8g ./bin/spark-shell -c $CORES_ACROSS_CLUSTER The downside of this though is that SPARK_MEM also sets

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Oleg Proudnikov
Thank you, Hassan! On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote: just use -Dspark.executor.memory= -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html Sent from the Apache Spark

Re: Serialization problem in Spark

2014-06-06 Thread Mayur Rustagi
Where are you getting serialization error. Its likely to be a different problem. Which class is not getting serialized? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Jun 5, 2014 at 6:32 PM, Vibhor Banga

Re: creating new ami image for spark ec2 commands

2014-06-06 Thread Akhil Das
you can comment out this function and Create a new one which will return your ami-id and the rest of the script will run fine. def get_spark_ami(opts): instance_types = { m1.small:pvm, m1.medium: pvm, m1.large:pvm, m1.xlarge: pvm, t1.micro:pvm, c1.medium:

Re: Using mongo with PySpark

2014-06-06 Thread Mayur Rustagi
Yes initialization each turn is hard.. you seem to using python. Another risky thing you can try is to serialize the mongoclient object using any serializer (like kryo wrappers in python) pass it on to mappers.. then in each mapper you'll just have to unserialize it use it directly... This may

Re: How to shut down Spark Streaming with Kafka properly?

2014-06-06 Thread Sean Owen
I closed that PR for other reasons. This change is still proposed by itself: https://issues.apache.org/jira/browse/SPARK-2034 https://github.com/apache/spark/pull/980 On Fri, Jun 6, 2014 at 1:35 AM, Tobias Pfeiffer t...@preferred.jp wrote: Sean, your patch fixes the issue, thank you so much!

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-06 Thread Denes
I have the same problem (Spark 0.9.1- 1.0.0 and throws error) and I do call saveAsTextFile. Recompiled for 1.0.0. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:10 failed 4 times, most recent failure: Exception failure in TID 1616 on host

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-06 Thread HenriV
I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage. but saveAsTextFile is returning errors while saving in the cloud or saving local. When i start a job in the cluster i do get an error but after this error it keeps on running fine

RE: range partitioner with updateStateByKey

2014-06-06 Thread RodrigoB
Hi TD, I have the same question: I need the workers to process using arrival order since it's updating a state based on previous one. tnks in advance. Rod -- View this message in context:

Re: Join : Giving incorrect result

2014-06-06 Thread Ajay Srivastava
Thanks Matei. We have tested the fix and it's working perfectly. Andrew, we set spark.shuffle.spill=false but the application goes out of memory. I think that is expected. Regards,Ajay On Friday, June 6, 2014 3:49 AM, Andrew Ash and...@andrewash.com wrote: Hi Ajay, Can you please try

Is Spark-1.0.0 not backward compatible with Shark-0.9.1 ?

2014-06-06 Thread bijoy deb
Hi, I am trying to run build Shark-0.9.1 from source,with Spark-1.0.0 as its dependency,using sbt package command.But I am getting the below error during build,which is making me think that perhaps Spark-1.0.0 is not compatible with Shark-0.9.1: [info] Compilation completed in 9.046 s

Re: creating new ami image for spark ec2 commands

2014-06-06 Thread Matt Work Coarr
Thanks for the response Akhil. My email may not have been clear, but my question is about what should be inside the AMI image, not how to pass an AMI id in to the spark_ec2 script. Should certain packages be installed? Do certain directories need to exist? etc... On Fri, Jun 6, 2014 at 4:40

Re: spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-06 Thread Eugen Cepoi
Here is the PR https://github.com/apache/spark/pull/997 2014-06-03 19:24 GMT+02:00 Patrick Wendell pwend...@gmail.com: You can set an arbitrary properties file by adding --properties-file argument to spark-submit. It would be nice to have spark-submit also look in SPARK_CONF_DIR as well by

Re: creating new ami image for spark ec2 commands

2014-06-06 Thread Akhil Das
Hi Matt, You will be needing the following on the AMI: 1. Java Installed 2. Root login enabled 3. /mnt should be available (Since all the storage goes here) Rest of the things spark-ec2 script will set up for you. Let me know if you need anymore clarification on this. Thanks Best Regards

Re: Why Scala?

2014-06-06 Thread Nicholas Chammas
To add another note on the benefits of using Scala to build Spark, here is a very interesting and well-written post http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html on the Databricks blog about how Scala 2.10's runtime reflection enables

Using Java functions in Spark

2014-06-06 Thread Oleg Proudnikov
Hi All, I am passing Java static methods into RDD transformations map and mapValues. The first map is from a simple string K into a (K,V) where V is a Java ArrayList of large text strings, 50K each, read from Cassandra. MapValues does processing of these text blocks into very small ArrayLists.

Bayes Net with Graphx?

2014-06-06 Thread Greg
Hi, I want to create a (very large) Bayes net using Graphx. To do so, I need to able to associate conditional probability tables with each node of the graph. Is there any way to do this? All of the examples I've seen just have the basic nodes and vertices, no associated information. thanks, Greg

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-06 Thread Gianluca Privitera
Where are the API for QueueStream and RddQueue? In my solution I cannot open a DStream with S3 location because I need to run a script on the file (a script that unluckily doesn't accept stdin as input), so I have to download it on my disk somehow than handle it from there before creating the

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Patrick Wendell
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell. On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Thank you, Hassan! On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote: just use -Dspark.executor.memory= -- View this

Re: Invalid Class Exception

2014-06-06 Thread Jenny Zhao
we experienced similar issue in our environment, below is the whole stack trace, it works fine if we run local mode, if we run it in cluster mode (even with Master and 1 worker on the same node), we have this serialversionUID issue. we use Spark 1.0.0 and compiled with JDK6. here is a link about

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Oleg Proudnikov
Thank you, Patrick I am planning to switch to 1.0 now. By the way of feedback - I used Andrew's suggestion and found that it does exactly that - sets Executor JVM heap - and nothing else. Workers have already been started and when the shell starts, it is now able to control Executor JVM heap.

Re: creating new ami image for spark ec2 commands

2014-06-06 Thread Matt Work Coarr
Thanks Akhil! I'll give that a try!

Re: Using Java functions in Spark

2014-06-06 Thread Oleg Proudnikov
Additional observation - the map and mapValues are pipelined and executed - as expected - in pairs. This means that there is a simple sequence of steps - first read from Cassandra and then processing for each value of K. This is the exact behaviour of a normal Java loop with these two steps

Re: Is Spark-1.0.0 not backward compatible with Shark-0.9.1 ?

2014-06-06 Thread Michael Armbrust
There is not an official updated version of Shark for Spark-1.0 (though you might check out the untested spark-1.0 branch on the github). You can also check out the preview release of Shark that runs on Spark SQL: https://github.com/amplab/shark/tree/sparkSql Michael On Fri, Jun 6, 2014 at

Spark 1.0 embedded Hive libraries

2014-06-06 Thread Silvio Fiorito
Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore) used in SparkSQL? Are they forked with Spark-specific customizations, like Shark, or simply relabeled with a new package name (org.spark-project.hive)? I couldn't find any repos on Github

Re: error loading large files in PySpark 0.9.0

2014-06-06 Thread Jeremy Freeman
Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat support. We recently wrote a custom hadoop input format so we can support flat binary files (https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop), and have been testing it in Scala.

Re: Bayes Net with Graphx?

2014-06-06 Thread Ankur Dave
Vertices can have arbitrary properties associated with them: http://spark.apache.org/docs/latest/graphx-programming-guide.html#the-property-graph Ankur http://www.ankurdave.com/

Re: Spark 1.0 embedded Hive libraries

2014-06-06 Thread Patrick Wendell
They are forked and slightly modified for two reasons: (a) Hive embeds a bunch of other dependencies in their published jars such that it makes it really hard for other projects to depend on them. If you look at the hive-exec jar they copy a bunch of other dependencies directly into this jar. We

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-06 Thread Ryan Compton
Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to

Spark Streaming window functions bug 1.0.0

2014-06-06 Thread Gianluca Privitera
Is anyone experiencing problems with windows? dstream1.print() val dstream2 = dstream1.groupByKeyAndWindow(Seconds(60)) dstream2.print() In my appslication the first print() prints out all the strings and their keys, but after the window function everything is lost and nothings gets printed.

Re: Using Spark on Data size larger than Memory size

2014-06-06 Thread Roger Hoover
Andrew, Thank you. I'm using mapPartitions() but as you say, it requires that every partition fit in memory. This will work for now but may not always work so I was wondering about another way. Thanks, Roger On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash and...@andrewash.com wrote: Hi Roger,

Showing key cluster stats in the Web UI

2014-06-06 Thread Nick Chammas
Someone correct me if this is wrong, but I believe 2 very important things to know about your cluster are: 1. How many cores does your cluster have available. 2. How much memory does your cluster have available. (Perhaps this could be divided into total/in-use/free or something.) Is

best practice: write and debug Spark application in scala-ide and maven

2014-06-06 Thread Wei Tan
Hi, I am trying to write and debug Spark applications in scala-ide and maven, and in my code I target at a Spark instance at spark://xxx object App { def main(args : Array[String]) { println( Hello World! ) val sparkConf = new

Re: Spark 1.0 embedded Hive libraries

2014-06-06 Thread Silvio Fiorito
Great, thanks for the info and pointer to the repo! From: Patrick Wendellmailto:pwend...@gmail.com Sent: ?Friday?, ?June? ?6?, ?2014 ?5?:?11? ?PM To: user@spark.apache.orgmailto:user@spark.apache.org They are forked and slightly modified for two reasons: (a) Hive embeds a bunch of other

stage kill link is awfully close to the stage name

2014-06-06 Thread Nick Chammas
Minor point, but does anyone else find the new (and super helpful!) kill link awfully close to the stage detail link in the 1.0.0 Web UI? I think it would be better to have the kill link flush right, leaving a large amount of space between it the stage detail link. Nick -- View this message

Re: stage kill link is awfully close to the stage name

2014-06-06 Thread Mikhail Strebkov
Nick Chammas wrote I think it would be better to have the kill link flush right, leaving a large amount of space between it the stage detail link. I think even better would be to have a pop-up confirmation Do you really want to kill this stage? :) -- View this message in context:

unit test

2014-06-06 Thread b0c1
Hi! I have two question: 1. I want to test my application. My app will write the result to elasticsearch (stage 1) with saveAsHadoopFile. How can I write test case for it? Need to pull up a MiniDFSCluster? Or there are other way? My application flow plan: Kafka = Spark Streaming (enrich) -

Re: NoSuchElementException: key not found

2014-06-06 Thread RodrigoB
Hi Tathagata, Im seeing the same issue on a load run over night with Kafka streaming (6000 mgs/sec) and 500millisec batch size. Again occasional and only happening after a few hours I believe Im using updateStateByKey. Any comment will be very welcome. tnks, Rod -- View this message in

cache spark sql parquet file in memory?

2014-06-06 Thread Xu (Simon) Chen
This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon

Re: New user streaming question

2014-06-06 Thread Jeremy Lee
Yup, when it's running, DStream.print() will print out a timestamped block for every time step, even if the block is empty. (for v1.0.0, which I have running in the other window) If you're not getting that, I'd guess the stream hasn't started up properly. On Sat, Jun 7, 2014 at 11:50 AM,

Best practise for 'Streaming' dumps?

2014-06-06 Thread Jeremy Lee
It's going well enough that this is a how should I in 1.0.0 rather than how do i question. So I've got data coming in via Streaming (twitters) and I want to archive/log it all. It seems a bit wasteful to generate a new HDFS file for each DStream, but also I want to guard against data loss from

Re: stage kill link is awfully close to the stage name

2014-06-06 Thread Mayur Rustagi
And then a are you sure after that :) On 7 Jun 2014 06:59, Mikhail Strebkov streb...@gmail.com wrote: Nick Chammas wrote I think it would be better to have the kill link flush right, leaving a large amount of space between it the stage detail link. I think even better would be to have a