Re: Getting started : Spark on YARN issue

2014-06-20 Thread Praveen Seluka
Hi Andrew Thanks Andrew for your suggestion. I updated the hdfs-site on server side and also on client side to use hostname instead of IP as mentioned here = http://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresses/ . Now, I could see that the client is

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread abhiguruvayya
Any inputs on this will be helpful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7980.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread Shixiong Zhu
You can use JavaPairRDD.saveAsHadoopFile/saveAsNewAPIHadoopFile. Best Regards, Shixiong Zhu 2014-06-20 14:22 GMT+08:00 abhiguruvayya sharath.abhis...@gmail.com: Any inputs on this will be helpful. -- View this message in context:

problem about cluster mode of spark 1.0.0

2014-06-20 Thread randylu
my programer runs in standalone model, the commond line is like: /opt/spark-1.0.0/bin/spark-submit \ --verbose \ --class $class_name --master spark://master:7077 \ --driver-memory 15G \ --driver-cores 2 \ --deploy-mode cluster \

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread abhiguruvayya
Does JavaPairRDD.saveAsHadoopFile store data as a sequenceFile? Then what is the significance of RDD.saveAsSequenceFile? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7983.html Sent from

Re: 1.0.1 release plan

2014-06-20 Thread Patrick Wendell
Hey There, I'd like to start voting on this release shortly because there are a few important fixes that have queued up. We're just waiting to fix an akka issue. I'd guess we'll cut a vote in the next few days. - Patrick On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim m...@palantir.com wrote: Hi

Re: broadcast in spark streaming

2014-06-20 Thread Hahn Jiang
I get it. thank you On Fri, Jun 20, 2014 at 4:43 PM, Sourav Chandra sourav.chan...@livestream.com wrote: From the StreamingContext object, you can get reference of SparkContext using which you can create broadcast variables On Fri, Jun 20, 2014 at 2:09 PM, Hahn Jiang

How could I set the number of executor?

2014-06-20 Thread Earthson
spark-submit has an arguments --num-executors to set the number of executor, but how could I set it from anywhere else? We're using Shark, and want to change the number of executor. The number of executor seems to be same as workers by default? Shall we configure the executor number manually(Is

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
Le 20 juin 2014 01:46, Shivani Rao raoshiv...@gmail.com a écrit : Hello Andrew, i wish I could share the code, but for proprietary reasons I can't. But I can give some idea though of what i am trying to do. The job reads a file and for each line of that file and processors these lines. I am

Re: How could I set the number of executor?

2014-06-20 Thread Earthson
--num-executors seems to be only available with YARN-only. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-set-the-number-of-executor-tp7990p7992.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLLib inside Storm : silly or not ?

2014-06-20 Thread Eustache DIEMERT
Yes, learning on a dedicated Spark cluster and predicting inside a Storm bolt is quite OK :) Thanks all for your answers. I'll post back if/when we experience this solution. E/ 2014-06-19 20:45 GMT+02:00 Shuo Xiang shuoxiang...@gmail.com: If I'm understanding correctly, you want to use

Anything like grid search available for mlbase?

2014-06-20 Thread Charles Earl
Looking for something like scikit's grid search module. C

parallel Reduce within a key

2014-06-20 Thread ansriniv
Hi, I am on Spark 0.9.0 I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32 cores in the cluster). I have an input rdd with 64 partitions. I am running sc.mapPartitions(...).reduce(...) I can see that I get full parallelism on the mapper (all my 32 cores are busy

java.net.SocketTimeoutException: Read timed out and java.io.IOException: Filesystem closed on Spark 1.0

2014-06-20 Thread Arun Ahuja
Hi all, I'm running a job that seems to continually fail with the following exception: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at

Re: Anything like grid search available for mlbase?

2014-06-20 Thread Xiangrui Meng
This is a planned feature for v1.1. I'm going to work on it after v1.0.1 release. -Xiangrui On Jun 20, 2014, at 6:46 AM, Charles Earl charles.ce...@gmail.com wrote: Looking for something like scikit's grid search module. C

Performance problems on SQL JOIN

2014-06-20 Thread mathias
Hi there, We're trying out Spark and are experiencing some performance issues using Spark SQL. Anyone who can tell us if our results are normal? We are using the Amazon EC2 scripts to create a cluster with 3 workers/executors (m1.large). Tried both spark 1.0.0 as well as the git master; the

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Shivani Rao
Hello Abhi, I did try that and it did not work And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So how did you overcome this problem? Shivani On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Le 20 juin 2014 01:46, Shivani Rao

Re: trying to understand yarn-client mode

2014-06-20 Thread Koert Kuipers
thanks! i will try that. i guess what i am most confused about is why the executors are trying to retrieve the jars directly using the info i provided to add jars to my spark context. i mean, thats bound to fail no? i could be on a different machine (so my file://) isnt going to work for them, or

Better way to use a large data set?

2014-06-20 Thread Muttineni, Vinay
Hi All, I have a 8 mill row, 500 column data set, which is derived by reading a text file and doing a filter, flatMap operation to weed out some anomalies. Now, I have a process which has to run through all 500 columns, do couple of map, reduce, forEach operations on the data set and return some

Re: Performance problems on SQL JOIN

2014-06-20 Thread Xiangrui Meng
Your data source is S3 and data is used twice. m1.large does not have very good network performance. Please try file.count() and see how fast it goes. -Xiangrui On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk wrote: Hi there, We're trying out Spark and are

Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks
Also - you could consider caching your data after the first split (before the first filter), this will prevent you from retrieving the data from s3 twice. On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote: Your data source is S3 and data is used twice. m1.large does not

broadcast not working in yarn-cluster mode

2014-06-20 Thread Christophe Préaud
Hi, Since I migrated to spark 1.0.0, a couple of applications that used to work in 0.9.1 now fail when broadcasting a variable. Those applications are run on a YARN cluster in yarn-cluster mode (and used to run in yarn-standalone mode in 0.9.1) Here is an extract of the error log: Exception

Re: How do you run your spark app?

2014-06-20 Thread Shivani Rao
Hello Michael, I have a quick question for you. Can you clarify the statement build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job. Can you give an example. I am using sbt assembly as well to create a fat jar, and supplying the

spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this? in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Marcelo Vanzin
Hi Koert, Could you provide more details? Job arguments, log messages, errors, etc. On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote: i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this?

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread Kan Zhang
Yes, it can if you set the output format to SequenceFileOutputFormat. The difference is saveAsSequenceFile does the conversion to Writable for you if needed and then calls saveAsHadoopFile. On Fri, Jun 20, 2014 at 12:43 AM, abhiguruvayya sharath.abhis...@gmail.com wrote: Does

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
In my case it was due to a case class I was defining in the spark-shell and not being available on the workers. So packaging it in a jar and adding it with ADD_JARS solved the problem. Note that I don't exactly remember if it was an out of heap space exception or pergmen space. Make sure your

Re: trying to understand yarn-client mode

2014-06-20 Thread Marcelo Vanzin
On Fri, Jun 20, 2014 at 8:22 AM, Koert Kuipers ko...@tresata.com wrote: thanks! i will try that. i guess what i am most confused about is why the executors are trying to retrieve the jars directly using the info i provided to add jars to my spark context. i mean, thats bound to fail no? i

Re: 1.0.1 release plan

2014-06-20 Thread Andrew Ash
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the below issues without running a patched version of Spark: https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version conflicts for client applications https://issues.apache.org/jira/browse/SPARK-2043 --

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
yeah sure see below. i strongly suspect its something i misconfigured causing yarn to try to use local filesystem mistakenly. * [koert@cdh5-yarn ~]$ /usr/local/lib/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3

Re: Performance problems on SQL JOIN

2014-06-20 Thread mathias
Thanks for your suggestions. file.count() takes 7s, so that doesn't seem to be the problem. Moreover, a union with the same code/CSV takes about 15s (SELECT * FROM rooms2 UNION SELECT * FROM rooms3). The web status page shows that both stages 'count at joins.scala:216' and 'reduce at

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread bc Wong
Koert, is there any chance that your fs.defaultFS isn't setup right? On Fri, Jun 20, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote: yeah sure see below. i strongly suspect its something i misconfigured causing yarn to try to use local filesystem mistakenly. *

Can not checkpoint Graph object's vertices but could checkpoint edges

2014-06-20 Thread dash
I'm trying to workaround the StackOverflowError when an object have a long dependency chain, someone said I should use checkpoint to cuts off dependencies. I write a sample code to test it, but I can only checkpoint edges but not vertices. I think I do materialize vertices and edges after calling

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott
I've tried to parallelize the separate regressions using allResponses.toParArray.map( x= do logistic regression against labels in x) But I start to see messages like 14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task 363.0:4) 14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss

Re: How do you run your spark app?

2014-06-20 Thread Shrikar archak
Hi Shivani, I use sbt assembly to create a fat jar . https://github.com/sbt/sbt-assembly Example of the sbt file is below. import AssemblyKeys._ // put this at the top of the file assemblySettings mainClass in assembly := Some(FifaSparkStreaming) name := FifaSparkStreaming version := 1.0

Re: parallel Reduce within a key

2014-06-20 Thread Michael Malak
How about a treeReduceByKey? :-) On Friday, June 20, 2014 11:55 AM, DB Tsai dbt...@stanford.edu wrote: Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically

Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Shrikar archak
Hi All, I was curious to know which of the two approach is better for doing analytics using spark streaming. Lets say we want to add some metadata to the stream which is being processed like sentiment, tags etc and then perform some analytics using these added metadata. 1) Is it ok to make a

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-20 Thread Sébastien Rainville
Hi, this is just a follow-up regarding this issue. Turns out that it's caused by a bug in Spark. I created a case for it: https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch. Any chance this could be included in the 1.0.1 release? Thanks, - Sebastien On Tue, Jun 17, 2014

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
ok solved it. as it happened in spark/conf i also had a file called core.site.xml (with some tachyone related stuff in it) so thats why it ignored /etc/hadoop/conf/core-site.xml On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers ko...@tresata.com wrote: i put some logging statements in

Running Spark alongside Hadoop

2014-06-20 Thread Sameer Tilak
Dear Spark users, I have a small 4 node Hadoop cluster. Each node is a VM -- 4 virtual cores, 8GB memory and 500GB disk. I am currently running Hadoop on it. I would like to run Spark (in standalone mode) along side Hadoop on the same nodes. Given the configuration of my nodes, will that work?

Re: Running Spark alongside Hadoop

2014-06-20 Thread Mayur Rustagi
The ideal way to do that is to use a cluster manager like Yarn mesos. You can control how much resources to give to which node etc. You should be able to run both together in standalone mode however you may experience varying latency performance in the cluster as both MR spark demand resources

Re: Spark and RDF

2014-06-20 Thread Mayur Rustagi
You are looking to create Shark operators for RDF? Since Shark backend is shifting to SparkSQL it would be slightly hard but much better effort would be to shift Gremlin to Spark (though a much beefier one :) ) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Spark and RDF

2014-06-20 Thread andy petrella
Maybe some SPARQL features in Shark, then ? aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Fri, Jun 20, 2014 at 9:45 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: You are looking to create Shark operators for RDF? Since Shark backend is

Re: Spark and RDF

2014-06-20 Thread Mayur Rustagi
or a seperate RDD for sparql operations ala SchemaRDD .. operators for sparql can be defined thr.. not a bad idea :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 20, 2014 at 3:56 PM, andy petrella

Re: Running Spark alongside Hadoop

2014-06-20 Thread Koert Kuipers
for development/testing i think its fine to run them side by side as you suggested, using spark standalone. just be realistic about what size data you can load with limited RAM. On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: The ideal way to do that is to use a

Re: Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Tathagata Das
If the metadata is directly related to each individual records, then it can be done either ways. Since I am not sure how easy or hard will it be for you add tags before putting the data into spark streaming, its hard to recommend one method over the other. However, if the metadata is related to

Set the number/memory of workers under mesos

2014-06-20 Thread Shuo Xiang
Hi, just wondering anybody knows how to set up the number of workers (and the amount of memory) in mesos, while lauching spark-shell? I was trying to edit conf/spark-env.sh and it looks like that the environment variables are for YARN of standalone. Thanks!

Re: Set the number/memory of workers under mesos

2014-06-20 Thread Mayur Rustagi
You should be able to configure in spark context in Spark shell. spark.cores.max memory. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 20, 2014 at 4:30 PM, Shuo Xiang shuoxiang...@gmail.com wrote:

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
In short, ADD_JARS will add the jar to your driver classpath and also send it to the workers (similar to what you are doing when you do sc.addJars). ex: MASTER=master/url ADD_JARS=/path/to/myJob.jar ./bin/spark-shell You also have SPARK_CLASSPATH var but it does not distribute the code, it is

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott
I looks like I was running into https://issues.apache.org/jira/browse/SPARK-2204 The issues went away when I changed to spark.mesos.coarse. Kyle On Fri, Jun 20, 2014 at 10:36 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I've tried to parallelize the separate regressions using

Re: options set in spark-env.sh is not reflecting on actual execution

2014-06-20 Thread Andrew Or
Hi Meethu, Are you using Spark 1.0? If so, you should use spark-submit ( http://spark.apache.org/docs/latest/submitting-applications.html), which has --executor-memory. If you don't want to specify this every time you submit an application, you can also specify spark.executor.memory in

kibana like frontend for spark

2014-06-20 Thread Mohit Jaggi
Folks, I want to analyse logs and I want to use spark for that. However, elasticsearch has a fancy frontend in Kibana. Kibana's docs indicate that it works with elasticsearch only. Is there a similar frontend that can work with spark? Mohit. P.S.: On MapR's spark FAQ I read a statement like

Fwd: Using Spark

2014-06-20 Thread Ricky Thomas
Hi, Would like to add ourselves to the user list if possible please? Company: truedash url: truedash.io Automatic pulling of all your data in to Spark for enterprise visualisation, predictive analytics and data exploration at a low cost. Currently in development with a few clients. Thanks

Re: How do you run your spark app?

2014-06-20 Thread Shivani Rao
Hello Shrikar, Thanks for your email. I have been using the same workflow as you did. But my questions was related to creation of the sparkContext. My question was If I am specifying jars in the java -cp jar-paths, and adding to them to my build.sbt, do I need to additionally add them in my code

Re: Worker dies while submitting a job

2014-06-20 Thread Shivani Rao
That error typically means that there is a communication error (wrong ports) between master and worker. Also check if the worker has write permissions to create the work directory. We were getting this error due one of the above two reasons On Tue, Jun 17, 2014 at 10:04 AM, Luis Ángel Vicente

Re: How do you run your spark app?

2014-06-20 Thread Andrei
Hi Shivani, Adding JARs to classpath (e.g. via -cp option) is needed to run your _local_ Java application, whatever it is. To deliver them to _other machines_ for execution you need to add them to SparkContext. And you can do it in 2 different ways: 1. Add them right from your code (your

sc.textFile can't recognize '\004'

2014-06-20 Thread anny9699
Hi, I need to parse a file which is separated by a series of separators. I used SparkContext.textFile and I met two problems: 1) One of the separators is '\004', which could be recognized by python or R or Hive, however Spark seems can't recognize this one and returns a symbol looking like '?'.

Re: Running Spark alongside Hadoop

2014-06-20 Thread Ognen Duzlevski
I only ran HDFS on the same nodes as Spark and that worked out great performance and robustness wise. However, I did not run Hadoop itself to do any computations/jobs on the same nodes. My expectation is that if you actually ran both at the same time with your configuration, the performance