python worker crash in spark 1.0

2014-06-19 Thread Schein, Sagi
Hi, I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd behavior. When, in pyspark, I invoke sc.textFile(hdfs://hadoop-ha01:/user/x/events_2.1).take(1) the call crashes with the below stack trace. The file resides in hadoop 2.2, it is a large event data,

Re: Contribution to Spark MLLib

2014-06-19 Thread Jayati
Hello Xiangrui, I am looking at the Spark Issues, but just wanted to know, if it is mandatory for me to work on existing JIRAs before I can contribute to MLLib. Regards, Jayati -- View this message in context:

Re: Contribution to Spark MLLib

2014-06-19 Thread Xiangrui Meng
Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar with the modification you mentioned since the paper is new. We may need to spend more time to learn the trade-offs. Feel free to create a JIRA for PLSA and we can move our discussion there. It would be great if you can share

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
Yes, I need to call the external service for every event and the order does not matter. There's no time limit in which each events should be processed. I can't tell the producer to slow down nor drop events. Of course I could put a message broker in between like an AMQP or JMS broker but I was

MLLib inside Storm : silly or not ?

2014-06-19 Thread Eustache DIEMERT
Hi Sparkers, We have a Storm cluster and looking for a decent execution engine for machine learned models. What I've seen from MLLib is extremely positive, but we can't just throw away our Storm based stack. So my question is: is it feasible/recommended to train models in Spark/MLLib and execute

Re: Spark streaming and rate limit

2014-06-19 Thread Michael Cutler
Hello Flavio, It sounds to me like the best solution for you is to implement your own ReceiverInputDStream/Receiver component to feed Spark Streaming with DStreams. It is not as scary as it sounds, take a look at some of the examples like TwitterInputDStream

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
Hi Michael, thanks for the tip, it's really an elegant solution. What I'm still missing here (maybe I should take a look at the code of TwitterInputDStream https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala..) is

Re: Spark streaming and rate limit

2014-06-19 Thread Michael Cutler
Hi Flavio, When your streaming job starts somewhere in the cluster the Receiver will be started in its own thread/process. You can do whatever you like within the receiver e.g. start and manage your own thread pool to fetch external data and feed Spark. If your Receiver dies Spark will

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-19 Thread Makoto Yui
Xiangrui and Debasish, (2014/06/18 6:33), Debasish Das wrote: I did run pretty big sparse dataset (20M rows, 3M sparse features) and I got 100 iterations of SGD running in 200 seconds...10 executors each with 16 GB memory... I could figure out what the problem is. spark.akka.frameSize was too

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
Ok, I'll try to start from that when I'll try to implement it. Thanks again for the great support! Best, Flavio On Thu, Jun 19, 2014 at 10:57 AM, Michael Cutler mich...@tumra.com wrote: Hi Flavio, When your streaming job starts somewhere in the cluster the Receiver will be started in its

write event logs with YARN

2014-06-19 Thread Christophe Préaud
Hi, I am trying to use the new Spark history server in 1.0.0 to view finished applications (launched on YARN), without success so far. Here are the relevant configuration properties in my spark-defaults.conf: spark.yarn.historyServer.address=server_name:18080 spark.ui.killEnabled=false

DStream join with a RDD (v1.0.0)

2014-06-19 Thread haopu
I have a JavaPairDStream whose RDD looks like to be hello, world. I want it to join with a JavaPairRDD which has one item as hello, spark. I expect the joined result to be something like this hello, (world, spark). However, I see result to be hello, (world, world). Is it a bug? Any suggestions

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Charles Earl
While I can't definitively speak to MLLib online learning, I'm sure you're evaluating Vowpal Wabbit, for which there's been some storm integrations contributed. Also you might look at factorie, http://factorie.cs.understanding.edu, which at least provides an online lda. C On Thursday, June 19,

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Eustache DIEMERT
Well, yes VW is an appealing option but I only found experimental integrations so far. Also, early experiments suggest Decision Trees Ensembles (RF, GBT) perform better than generalized linear models on our data. Hence the interest for MLLib :) Any other comments / suggestions welcome :) E/

Query on Merge Message (Graph: pregel operator)

2014-06-19 Thread Ghousia Taj
Hi, Can someone please clarify a small query on Graph.pregel operator. As per the documentation on merge Message function, only two inbound messages can be merged to a single value. Is it the actual case, if so how can one merge n inbound messages . Any help is truly appreciated. Many Thanks,

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Surendranauth Hiraman
I can't speak for MLlib, too. But I can say the model of training in Hadoop M/R or Spark and production scoring in Storm works very well. My team has done online learning (Sofia ML library, I think) in Storm as well. I would be interested in this answer as well. -Suren On Thu, Jun 19, 2014 at

Getting started : Spark on YARN issue

2014-06-19 Thread Praveen Seluka
I am trying to run Spark on YARN. I have a hadoop 2.2 cluster (YARN + HDFS) in EC2. Then, I compiled Spark using Maven with 2.2 hadoop profiles. Now am trying to run the example Spark job . (In Yarn-cluster mode). From my *local machine. *I have setup HADOOP_CONF_DIR environment variable

Re: implicit ALS dataSet

2014-06-19 Thread Sean Owen
On Thu, Jun 19, 2014 at 3:03 PM, redocpot julien19890...@gmail.com wrote: We did some sanity check. For example, each user has his own item list which is sorted by preference, then we just pick the top 10 items for each user. As a result, we found that there were only 169 different items among

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark standalone. for example if i have a akka timeout setting that i would like to be applied to every piece of the spark framework (so spark master, spark workers, spark executor sub-processes, spark-shell, etc.). i used to do

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-19 Thread Xiangrui Meng
It is because the frame size is not set correctly in executor backend. see spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate? On Jun 19, 2014, at 2:01 AM, Makoto Yui yuin...@gmail.com wrote: Xiangrui and Debasish, (2014/06/18 6:33), Debasish Das wrote: I did

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-19 Thread ldmtwo
Here is a partial comparison. http://dspace.mit.edu/bitstream/handle/1721.1/82517/MIT-CSAIL-TR-2013-028.pdf?sequence=2 SciDB uses MPI with Intel HW and libraries. Amazing performance at the cost of more work. In case the link stops working: A Complex Analytics Genomics Benchmark Rebecca Taft-,

Getting different answers running same line of code

2014-06-19 Thread mrm
Hi, I have had this issue for some time already, where I get different answers when I run the same line of code twice. I have run some experiments to see what is happening, please help me! Here is the code and the answers that I get. I suspect I have a problem when reading large datasets from S3.

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-19 Thread Evan R. Sparks
Larry, I don't see any reference to Spark in particular there. Additionally, the benchmark only scales up to datasets that are roughly 10gb (though I realize they've picked some fairly computationally intensive tasks), and they don't present their results on more than 4 nodes. This can hide

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-19 Thread Makoto Yui
Xiangrui, (2014/06/19 23:43), Xiangrui Meng wrote: It is because the frame size is not set correctly in executor backend. see spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate? Not yet. I will wait the v1.0.1 release. Thanks, Makoto

Re: implicit ALS dataSet

2014-06-19 Thread Sean Owen
On Thu, Jun 19, 2014 at 3:44 PM, redocpot julien19890...@gmail.com wrote: As the paper said, the low ratings will get a low confidence weight, so if I understand correctly, these dominant one-timers will be more *unlikely* to be recommended comparing to other items whose nbPurchase is bigger.

1.0.1 release plan

2014-06-19 Thread Mingyu Kim
Hi all, Is there any plan for 1.0.1 release? Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: Long running Spark Streaming Job increasing executing time per batch

2014-06-19 Thread Tathagata Das
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can you tell which stage is the processing of each batch is causing the increase in the processing time? Also, what is the batch interval, and checkpoint interval? TD On Thu, Jun 19, 2014 at 8:45 AM, Skogberg, Fredrik

Possible approaches for adding extra metadata (Spark Streaming)

2014-06-19 Thread Shrikar archak
Hi All, I was curious to know which of the two approach is better for doing analytics using spark streaming. Lets say we want to add some metadata to the stream which is being processed like sentiment, tags etc and then perform some analytics using these added metadata. 1) Is it ok to make a

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Surendranauth Hiraman
I've created an issue for this but if anyone has any advice, please let me know. Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two remaining tasks (out of 320). Those tasks seem to be waiting on data from another task on another node. Eventually (about 2 hours later) they

Parallel LogisticRegression?

2014-06-19 Thread Kyle Ellrott
I'm working on a problem learning several different sets of responses against the same set of training features. Right now I've written the program to cycle through all of the different label sets, attached them to the training data and run LogisticRegressionWithSGD on each of them. ie foreach

Re: trying to understand yarn-client mode

2014-06-19 Thread Kevin Markey
Yarn client is much like Spark client mode, except that the executors are running in Yarn containers managed by the Yarn resource manager on the cluster instead of as Spark workers managed by the Spark master.  The driver executes as a local client in your local JVM.  It

Re: Issue while trying to aggregate with a sliding window

2014-06-19 Thread Tathagata Das
zeroTime marks the time when the streaming job started, and the first batch of data is from zeroTime to zeroTime + slideDuration. The validity check of time - zeroTime) being multiple of slideDuration is to ensure that for a given dstream, it generates RDD at the right times. For example, say the

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Matei Zaharia
You should be able to use many of the MLlib Model objects directly in Storm, if you save them out using Java serialization. The only one that won’t work is probably ALS, because it’s a distributed model. Otherwise, you will have to output them in your own format and write code for evaluating

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
We are submitting the spark job in our tomcat application using yarn-cluster mode with great success. As Kevin said, yarn-client mode runs driver in your local JVM, and it will have really bad network overhead when one do reduce action which will pull all the result from executor to your local

How do you run your spark app?

2014-06-19 Thread ldmtwo
I want to ask this, not because I can't read endless documentation and several tutorials, but because there seems to be many ways of doing things and I keep having issues. How do you run /your /spark app? I had it working when I was only using yarn+hadoop1 (Cloudera), then I had to get Spark and

Re: spark with docker: errors with akka, NAT?

2014-06-19 Thread Mohit Jaggi
Based on Jacob's suggestion, I started using --net=host which is a new option in latest version of docker. I also set SPARK_LOCAL_IP to the host's IP address and then AKKA does not use the hostname and I don't need the Spark driver's hostname to be resolvable. Thanks guys for your help! On Tue,

Re: How do you run your spark app?

2014-06-19 Thread Evan R. Sparks
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like java -cp ... MyDriver. That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit -

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Shuo Xiang
If I'm understanding correctly, you want to use MLlib for offline training and then deploy the learned model to Storm? In this case I don't think there is any problem. However if you are looking for online model update/training, this can be complicated and I guess quite a few algorithms in mllib

Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread Anita Tailor
I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file. My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro? Thanks in advance Anita On 18 June

Re: trying to understand yarn-client mode

2014-06-19 Thread Marcelo Vanzin
Coincidentally, I just ran into the same exception. What's probably happening is that you're specifying some jar file in your job as an absolute local path (e.g. just /home/koert/test-assembly-0.1-SNAPSHOT.jar), but your Hadoop config has the default FS set to HDFS. So your driver does not know

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-19 Thread lannyripple
Gino, I can confirm that your solution of assembling with spark-streaming-kafka but excluding spark-core and spark-streaming has me working with basic spark-submit. As mentioned you must specify the assembly jar in the SparkConfig as well as to spark-submit. When I see the error you are now

Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
db tsai, if in yarn-cluster mode the driver runs inside yarn, how can you do a rdd.collect and bring the results back to your application? On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai dbt...@stanford.edu wrote: We are submitting the spark job in our tomcat application using yarn-cluster mode with

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Patrick Wendell
I'll make a comment on the JIRA - thanks for reporting this, let's get to the bottom of it. On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I've created an issue for this but if anyone has any advice, please let me know. Basically, on about 10 GBs of

Re: How to achieve reasonable performance on Spark Streaming?

2014-06-19 Thread Tathagata Das
Hello all, Apologies for the late response, this thread went below my radar. There are a number of things that can be done to improve the performance. Here are some of them of the top of my head based on what you have mentioned. Most of them are mentioned in the streaming guide's performance

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
Currently, we save the result in HDFS, and read it back in our application. Since Clinet.run is blocking call, it's easy to do it in this way. We are now working on using akka to bring back the result to app without going through the HDFS, and we can use akka to track the log, and stack trace,

increasing concurrency of saveAsNewAPIHadoopFile?

2014-06-19 Thread Sandeep Parikh
I'm trying to write a JavaPairRDD to a downstream database using saveAsNewAPIHadoopFile with a custom OutputFormat and the process is pretty slow. Is there a way to boost the concurrency of the save process? For example, something like splitting the RDD into multiple smaller RDDs and using Java

Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
okay. since for us the main purpose is to retrieve (small) data i guess i will stick to yarn client mode. thx On Thu, Jun 19, 2014 at 3:19 PM, DB Tsai dbt...@stanford.edu wrote: Currently, we save the result in HDFS, and read it back in our application. Since Clinet.run is blocking call, it's

Submiting multiple jobs via different threads

2014-06-19 Thread Zhang Haoming
Hi, I'm searching for some information about run programs concurrently on Spark. I did a simple experiment on the Spark Master, that is open two terminals, then submitted two programs to Master in the same time and watch the programs status via Spark Master Web UI. I found one program could

pyspark bug with unittest and scikit-learn

2014-06-19 Thread Brad Miller
Hi All, I am attempting to develop some unit tests for a program using pyspark and scikit-learn and I've come across some weird behavior. I receive the following warning during some tests python/pyspark/serializers.py:327: DeprecationWarning: integer argument expected, got float. Although it's

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Andrew Or
Hi Koert and Lukasz, The recommended way of not hard-coding configurations in your application is through conf/spark-defaults.conf as documented here: http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties. However, this is only applicable to spark-submit, so

Re: Getting started : Spark on YARN issue

2014-06-19 Thread Andrew Or
Hi Praveen, Yes, the fact that it is trying to use a private IP from outside of the cluster is suspicious. My guess is that your HDFS is configured to use internal IPs rather than external IPs. This means even though the hadoop confs on your local machine only use external IPs, the

Re: Getting started : Spark on YARN issue

2014-06-19 Thread Andrew Or
(Also, an easier workaround is to simply submit the application from within your cluster, thus saving you all the manual labor of reconfiguring everything to use public hostnames. This may or may not be applicable to your use case.) 2014-06-19 14:04 GMT-07:00 Andrew Or and...@databricks.com:

Re: Processing audio/video/images

2014-06-19 Thread jamal sasha
So.. Here is my experimental code to get a feel of it def read_file(filename): with open(filename) as f: lines = [ line for line in f] return lines files = [/somepath/.../test1.txt,sompath/.../test2.txt] test1.txt has foo bar this is test1 test2.txt bar foo this is text2

Re: increasing concurrency of saveAsNewAPIHadoopFile?

2014-06-19 Thread Nicholas Chammas
The main thing that will affect the concurrency of any saveAs...() operations is a) the number of partitions of your RDD, and b) how many cores your cluster has. How big is the RDD in question? How many partitions does it have? ​ On Thu, Jun 19, 2014 at 3:38 PM, Sandeep Parikh

Re: MLLib : Decision Tree with minimum points per node

2014-06-19 Thread Manish Amde
Hi Justin, I am glad to know that trees are working well for you. The trees will support minimum samples per node in a future release. Thanks for the feedback. -Manish On Fri, Jun 13, 2014 at 8:55 PM, Justin Yip yipjus...@gmail.com wrote: Hello, I have been playing around with mllib's

Re: How do you run your spark app?

2014-06-19 Thread Michael Cutler
When you start seriously using Spark in production there are basically two things everyone eventually needs: 1. Scheduled Jobs - recurring hourly/daily/weekly jobs. 2. Always-On Jobs - that require monitoring, restarting etc. There are lots of ways to implement these requirements,

Re: How do you run your spark app?

2014-06-19 Thread Michael Cutler
P.S. Last but not least we use sbt-assembly to build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job. These are automatically built from source by our Jenkins and stored in HDFS. Our Chronos/Marathon jobs fetch the latest release

Re: MLLib : Decision Tree with minimum points per node

2014-06-19 Thread Manish Amde
Hi Justin, I have created a JIRA ticket to keep track of your request. Thanks. https://issues.apache.org/jira/browse/SPARK-2207 -Manish On Thu, Jun 19, 2014 at 2:35 PM, Manish Amde manish...@gmail.com wrote: Hi Justin, I am glad to know that trees are working well for you. The trees will

Re: Long running Spark Streaming Job increasing executing time per batch

2014-06-19 Thread Skogberg, Fredrik
Hi TD, Thats quite odd. Yes, with checkpoint the lineage does not increase. Can you tell which stage is the processing of each batch is causing the increase in the processing time? I haven’t been able to determine exactly what stage that is causing the increase in processing time. Any

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
for a jvm application its not very appealing to me to use spark submit my application uses hadoop, so i should use hadoop jar, and my application uses spark, so it should use spark-submit. if i add a piece of code that uses some other system there will be yet another suggested way to launch

Re: pyspark-Failed to run first

2014-06-19 Thread Congrui Yi
I'm starting to develop ADMM for some models using pyspark(Spark version 1.0.0). So I constantly simulated data to test my code. I did simulation in python but then I ran into the same kind of problems as mentioned above. Same meaningless error messages show up when I tried methods like first,

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-19 Thread Shivani Rao
Hello Andrew, i wish I could share the code, but for proprietary reasons I can't. But I can give some idea though of what i am trying to do. The job reads a file and for each line of that file and processors these lines. I am not doing anything intense in the processLogs function import

Re: Query on Merge Message (Graph: pregel operator)

2014-06-19 Thread Ankur Dave
Many merge operations can be broken up to work incrementally. For example, if the merge operation is to sum *n* rank updates, then you can set mergeMsg = (a, b) = a + b and this function will be applied to all *n* updates in arbitrary order to yield a final sum. Addition, multiplication, min, max,

Re: BSP realization on Spark

2014-06-19 Thread Ankur Dave
Master/worker assignment and barrier synchronization are handled by Spark, so Pregel will automatically coordinate the computation, communication, and synchronization needed to run for the specified number of supersteps (or until there are no messages sent in a particular superstep). Ankur

Re: SparkR Installation

2014-06-19 Thread Zongheng Yang
Hi Stuti, Yes, you do need to install R on all nodes. Furthermore the rJava library is also required, which can be installed simply using 'install.packages(rJava)' in the R shell. Some more installation instructions after that step can be found in the README here:

How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread abhiguruvayya
I want to store JavaRDD as a sequence file instead of textfile. But i don't see any Java API for that. Is there a way for this? Please let me know. Thanks! -- View this message in context:

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-19 Thread abhiguruvayya
Once you have generated the final RDD before submitting it to reducer try to repartition the RDD either using coalesce(partitions) or repartition() into known partitions. 2. Rule of thumb to create number of data partitions (3 * num_executors * cores_per_executor). -- View this message in

Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread contractor
Unfortunately, I couldn’t figure it out without involving Avro. Here is something that may be useful since it uses Avro generic records (so no case classes needed) and transforms to Parquet. http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ HTH, Mahesh From:

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread Kan Zhang
Can you use saveAsObjectFile? On Thu, Jun 19, 2014 at 5:54 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I want to store JavaRDD as a sequence file instead of textfile. But i don't see any Java API for that. Is there a way for this? Please let me know. Thanks! -- View this message

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread abhiguruvayya
No. My understanding by reading the code is that RDD.saveAsObjectFile uses Java Serialization and RDD.saveAsSequenceFile uses Writable which is tied to the Writable Serialization framework in HDFS. -- View this message in context:

Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread Anita Tailor
Thanks Mahesh, I came across this example, look like it might give us some directions. https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example Thanks Anita On 20 June 2014 09:03, maheshtwc mahesh.padmanab...@twc-contractor.com wrote:

Re: Serialization problem in Spark

2014-06-19 Thread Daedalus
I'm not sure if this is a Hadoop-centric issue or not. I had similar issues with non-serializable external library classes. I used a Kryo config (as illustrated here https://spark.apache.org/docs/latest/tuning.html#data-serialization ) and registered the one troublesome class. It seemed to work

Repeated Broadcasts

2014-06-19 Thread Daedalus
I'm trying to use Spark (Java) for an optimization algorithm that needs repeated server-node exchanges of information. (The ADMM algorithm for whoever is familiar). In each iteration, I need to update a set of values on the nodes, and collect them on the server, which will update it's own set of

Re: How do you run your spark app?

2014-06-19 Thread Sonal Goyal
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler