Re: How to debug: Runs locally but not on cluster

2014-08-14 Thread jerryye
I've isolated this to a memory issue but I don't know what parameter I need to tweak. If I sample my samples RDD with 35% of the data, everything runs to completion, with 35%, it fails. In standalone mode, I can run on the full RDD without any problems. // works val samples =

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Tassilo Klein
Thanks. This was already helping a bit. But the examples don't use custom InputFormats. Rather, org.apache fully qualified InputFormat. If I want to use my own custom InputFormat in form of .class (or jar) how can I use it? I tried providing it to pyspark with --jars myCustomInputFormat.jar and

Re: training recsys model

2014-08-14 Thread Xiangrui Meng
Try many combinations of parameters on a small dataset, find the best, and then try to map them to a big dataset. You can also reduce the search region iteratively based on the best combination in the current iteration. -Xiangrui On Wed, Aug 13, 2014 at 1:13 AM, Hoai-Thu Vuong thuv...@gmail.com

Re: Spark Akka/actor failures.

2014-08-14 Thread Xiangrui Meng
Could you try to map it to row-majored first? Your approach may generate multiple copies of the data. The code should look like this: ~~~ val rows = rdd.map { case (j, values) = values.view.zipWithIndex.map { case (v, i) = (i, (j, v)) } }.groupByKey().map { case (i, entries) =

Re: Job aborted due to stage failure: TID x failed for unknown reasons

2014-08-14 Thread jerryye
bump. same problem here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

how to use the method saveAsTextFile of a RDD like javaRDDmyOwnClass[]

2014-08-14 Thread Gefei Li
Hello, I wrote a class named BooleanPair: public static class BooleanPairet implements Serializable{ public Boolean elementBool1; public Boolean elementBool2; BooleanPair(Boolean bool1, Boolean bool2){elementBool1 = bool1; elementBool2 = bool2;} public String

Re: how to use the method saveAsTextFile of a RDD like javaRDDmyOwnClass[]

2014-08-14 Thread Tathagata Das
FlatMap the JavaRDDBooleanPair[] to JavaRDDBooleanPair. Then it should work. TD On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li gefeili.2...@gmail.com wrote: Hello, I wrote a class named BooleanPair: public static class BooleanPairet implements Serializable{ public Boolean

read performance issue

2014-08-14 Thread Gurvinder Singh
Hi, I am running spark from the git directly. I recently compiled the newer version Aug 13 version and it has performance drop of 2-3x in read from HDFS compare to git version of Aug 1. So I am wondering which commit would have cause such an issue in read performance. The performance is almost

Re: how to use the method saveAsTextFile of a RDD like javaRDDmyOwnClass[]

2014-08-14 Thread Gefei Li
Thank you! It works so well for me! Regards, Gefei On Thu, Aug 14, 2014 at 4:25 PM, Tathagata Das tathagata.das1...@gmail.com wrote: FlatMap the JavaRDDBooleanPair[] to JavaRDDBooleanPair. Then it should work. TD On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li gefeili.2...@gmail.com wrote:

Re: How to direct insert vaules into SparkSQL tables?

2014-08-14 Thread chutium
oh, right, i meant within SqlContext alone, schemaRDD from text file with a case class -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html Sent from the Apache Spark User List mailing list

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Hoai-Thu Vuong
A man in this community give me a video: https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in this community and other guys helped me to solve this problem. I'm trying to load MatrixFactorizationModel from object file, but compiler said that, I can not create object because the

Re: how to use the method saveAsTextFile of a RDD like javaRDDmyOwnClass[]

2014-08-14 Thread Gefei Li
It is interesting to save a RDD on a disk or HDFS or somethings else as a set of objects, but I think it's more useful to save it as a text file for debugging or just as an output file. If we want to reuse a RDD, text file also works, but perhaps a set of object files will bring a decrease on

Re: Script to deploy spark to Google compute engine

2014-08-14 Thread Michael Hausenblas
Did you check out http://www.spark-stack.org/spark-cluster-on-google-compute/ already? Cheers, Michael -- Michael Hausenblas Ireland, Europe http://mhausenblas.info/ On 14 Aug 2014, at 05:17, Soumya Simanta soumya.sima...@gmail.com wrote: Before I start doing something on

Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Darin McBeath
I started up a cluster on EC2 (using the provided scripts) and specified a different instance type for the master and the the worker nodes.  The cluster started fine, but when I looked at the cluster (via port 8080), it showed that the amount of memory available to the worker nodes did not

Re: Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Akhil Das
Hi Darin, This is the piece of code https://github.com/mesos/spark-ec2/blob/v3/deploy_templates.py doing the actual work (Setting the memory). As you can see, it leaves 15Gb of ram for OS on a 100Gb machine... 2Gb RAM on a 10-20Gb machine etc. You can always set

Re: Python + Spark unable to connect to S3 bucket .... Invalid hostname in URI

2014-08-14 Thread Miroslaw
I have tried that already but still get the same error. To be honestly, I feel as though I am missing something obvious with my configuration, I just can't find what it may be. Miroslaw Horbal On Wed, Aug 13, 2014 at 10:38 PM, jerryye [via Apache Spark User List]

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Christopher Nguyen
Hi Hoai-Thu, the issue of private default constructor is unlikely the cause here, since Lance was already able to load/deserialize the model object. And on that side topic, I wish all serdes libraries would just use constructor.setAccessible(true) by default :-) Most of the time that privacy is

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Matt Narrell
I’d suggest something like Apache YARN, or Apache Mesos with Marathon or something similar to allow for management, in particular restart on failure. mn On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu

Re: Down-scaling Spark on EC2 cluster

2014-08-14 Thread Shubhabrata
What about down-scaling when I use Mesos, does that really deteriorate the performance ? Otherwise we would probably go for spark on mesos on ec2 :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Down-scaling-Spark-on-EC2-cluster-tp10494p12109.html Sent

Using Spark Streaming to listen to HDFS directory and handle different files by file name

2014-08-14 Thread ZhangYi
As we know, in Spark, SparkContext provide the wholeTextFile() method to read all files in the specific directory, then generate RDD(fileName, content): scala val lines = sc.wholeTextFiles(/Users/workspace/scala101/data) 14/08/14 22:43:02 INFO MemoryStore: ensureFreeSpace(35896) called with

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think I can reproduce this error. The following code cannot work and report Foo cannot be serialized. (log in gist https://gist.github.com/zsxwing/4f9f17201d4378fe3e16): class Foo { def foo() = Array(1.0) } val t = new Foo val m = t.foo val r1 = sc.parallelize(List(1, 2, 3)) val r2 = r1.map(_

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Silvio Fiorito
You also need to ensure you're using checkpointing and support recreating the context on driver failure as described in the docs here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node From: Matt Narrell

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread lancezhange
Following codes works, too class Foo1 extends Serializable { def foo() = Array(1.0) } val t1 = new Foo1 val m1 = t1.foo val r11 = sc.parallelize(List(1, 2, 3)) val r22 = r11.map(_ + m1(0)) r22.toArray On Thu, Aug 14, 2014 at 10:55 PM, Shixiong Zhu [via Apache Spark User List]

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think in the following case class Foo { def foo() = Array(1.0) } val t = new Foo val m = t.foo val r1 = sc.parallelize(List(1, 2, 3)) val r2 = r1.map(_ + m(0)) r2.toArray Spark should not serialize t. But looks it will. Best Regards, Shixiong Zhu 2014-08-14 23:22 GMT+08:00 lancezhange

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Kan Zhang
Good timing! I encountered that same issue recently and to address it, I changed the default Class.forName call to Utils.classForName. See my patch at https://github.com/apache/spark/pull/1916. After that change, my bin/pyspark --jars worked. On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein

SPARK_DRIVER_MEMORY

2014-08-14 Thread Brad Miller
Hi All, I have a Spark job for which I need to increase the amount of memory allocated to the driver to collect a large-ish (200M) data structure. Formerly, I accomplished this by setting SPARK_MEM before invoking my job (which effectively set memory on the driver) and then setting

Re: Ways to partition the RDD

2014-08-14 Thread ssb61
You can try something like this, val kvRdd = sc.textFile(rawdata/).map( m = { val pfUser = m.split(t,2) (pfUser(0) - pfUser(1))})

RE: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Sameer Tilak
Hi Yanbo,I think it was happening because some of the rows did not have all the columns. We are cleaning up the data and will let you know once we confirm this. Date: Thu, 14 Aug 2014 22:50:58 +0800 Subject: Re: java.lang.UnknownError: no bin was found for continuous variable. From:

Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
I tried with simple spark-hive select and insert, and it works. But to directly manipulate the ORCFile through RDD, spark has to be upgraded to support hive-0.13 first. Because some ORC API is not exposed until Hive-0.12. Thanks. Zhan Zhang On Aug 11, 2014, at 10:23 PM,

Mlib model: viewing and saving

2014-08-14 Thread Sameer Tilak
I have a mlib model: val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth) I see model has following methods:algo asInstanceOf isInstanceOf predicttoString topNode model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf =

SPARK_LOCAL_DIRS

2014-08-14 Thread Brad Miller
Hi All, I'm having some trouble setting the disk spill directory for spark. The following approaches set spark.local.dir (according to the Environment tab of the web UI) but produce the indicated warnings: *In spark-env.sh:* export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill *Associated

Re: SPARK_LOCAL_DIRS

2014-08-14 Thread Debasish Das
Actually I faced it yesterday... I had to put it in spark-env.sh and take it out from spark-defaults.conf on 1.0.1...Note that this settings should be visible on all workers.. After that I validated that SPARK_LOCAL_DIRS was indeed getting used for shuffling... On Thu, Aug 14, 2014 at 10:27

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread TJ Klein
Yes, thanks great. This seems to be the issue. At least running with spark-submit works as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html Sent from the Apache Spark User List mailing list archive

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks, will give that a try. I see the number of partitions requested is 8 (through HashPartitioner(8)). If I have a 40 node cluster, whats the recommended number of partitions? -- View this message in context:

Re: Subscribing to news releases

2014-08-14 Thread Nicholas Chammas
I've created an issue to track this: SPARK-3044: Create RSS feed for Spark News https://issues.apache.org/jira/browse/SPARK-3044 On Fri, May 30, 2014 at 11:07 AM, Nick Chammas nicholas.cham...@gmail.com wrote: Is there a way to subscribe to news releases

Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
Yes. You are right, but I tried old hadoopFile for OrcInputFormat. In hive12, OrcStruct is not exposing its api, so spark cannot access it. With Hive13, RDD can read from OrcFile. Btw, I didn’t see ORCNewOutputFormat in hive-0.13. Direct RDD manipulation (Hive13) val inputRead =

How to transform large local files into Parquet format and write into HDFS?

2014-08-14 Thread Parthus
Hi there, I have several large files (500GB per file) to transform into Parquet format and write to HDFS. The problems I encountered can be described as follows: 1) At first, I tried to load all the records in a file and then used sc.parallelize(data) to generate RDD and finally used

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
First, I think you might have a misconception about partitioning. ALL RDDs are partitioned (even if they are a single partition). When reading from HDFS the number of partitions depends on how the data is stored in HDFS. After data is shuffled (generally caused by things like reduceByKey), the

Documentation to start with

2014-08-14 Thread Abhilash K Challa
Hi, Do any one have specific documentation for integrating Spark with hadoop distribution(does not already have spark) ? Thanks, Abhilash

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
Hi Davies, I tried the second option and launched my ec2 cluster with master on all the slaves by providing the latest commit hash of master as the --spark-version option to the spark-ec2 script. However, I am getting the same errors as before. I am running the job with the original

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
The errors are occurring in the exact same time in the job as well..right at the end of the groupByKey() when 5 tasks are left. On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh ar...@automatic.com wrote: Hi Davies, I tried the second option and launched my ec2 cluster with master on all

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks Daniel for the detailed information. Since the RDD is already partitioned, there is no need to worry about repartitioning. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html Sent from the Apache Spark User

Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
I agree. We need the support similar to parquet file for end user. That’s the purpose of Spark-2883. Thanks. Zhan Zhang On Aug 14, 2014, at 11:42 AM, Yin Huai huaiyin@gmail.com wrote: I feel that using hadoopFile and saveAsHadoopFile to read and write ORCFile are more towards

Spark on HDP

2014-08-14 Thread Padmanabh
Hi, I was reading the documentation at http://hortonworks.com/labs/spark/ and it seems to say that Spark is not ready for enterprise, which I think is not quite right. What I think they wanted to say is Spark on HDP is not ready for enterprise. I was wondering if someone here is using Spark on

Re: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Joseph Bradley
I have run into that issue too, but only when the data were not pre-processed correctly. E.g., if a categorical feature is binary with values in {-1, +1} instead of {0,1}. Will be very interested to learn if it can occur elsewhere! On Thu, Aug 14, 2014 at 10:16 AM, Sameer Tilak

Re: Spark Akka/actor failures.

2014-08-14 Thread ldmtwo
The reason we are not using MLLib and Breeze is the lack of control over the data and performance. After computing the covariance matrix, there isn't too much we can do after that. Many of the methods are private. For now, we need the max value and the coresponding pair of columns. Later, we may

Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides

2014-08-14 Thread Denny Lee
For those whom were not able to attend the Seattle Spark Meetup - Spark at eBay - Troubleshooting the Everyday Issues, the slides have been now posted at:  http://files.meetup.com/12063092/SparkMeetupAugust2014Public.pdf. Enjoy! Denny

spark streaming - lamda architecture

2014-08-14 Thread salemi
Hi, How would you implement the batch layer of lamda architecture with spark/spark streaming? Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html Sent from the Apache Spark User List mailing list

Re: Spark webUI - application details page

2014-08-14 Thread SK
Hi, I am using Spark 1.0.1. But I am still not able to see the stats for completed apps on port 4040 - only for running apps. Is this feature supported or is there a way to log this info to some file? I am interested in stats about the total # of executors, total runtime, and total memory used by

Performance hit for using sc.setCheckPointDir

2014-08-14 Thread Debasish Das
Hi, For our large ALS runs, we are considering using sc.setCheckPointDir so that the intermediate factors are written to HDFS and the lineage is broken... Is there a comparison which shows the performance degradation due to these options ? If not I will be happy to add experiments with it...

Dealing with Idle shells

2014-08-14 Thread Gary Malouf
We have our quantitative team using Spark as part of their daily work. One of the more common problems we run into is that people unintentionally leave their shells open throughout the day. This eats up memory in the cluster and causes others to have limited resources to run their jobs. With

Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Hi, I'm having trouble compiling a snapshot, any advice would be appreciated. I get the error below when compiling either master or branch-1.1. The key error is, I believe, [ERROR] File name too long but I don't understand what it is referring to. Thanks! ./make-distribution.sh --tgz

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
There may be cases where you want to adjust the number of partitions or explicitly call RDD.repartition or RDD.coalesce. However, I would start with the defaults and then adjust if necessary to improve performance (for example, if cores are idling because there aren't enough tasks you may want

Re: Spark webUI - application details page

2014-08-14 Thread durin
If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS should achieve what you want. I'm logging to the HDFS, but according to the config page http://spark.apache.org/docs/latest/configuration.html a folder should be possible as well. Example with all other settings

Re: Spark webUI - application details page

2014-08-14 Thread Andrew Or
Hi all, As Simon explained, you need to set spark.eventLog.enabled to true. I'd like to add that the usage of SPARK_JAVA_OPTS to set spark configurations is deprecated. I'm sure many of you have noticed this from the scary warning message we print out. :) The recommended and supported way of

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread lancezhange
I finally solved the problem by following code var m: org.apache.spark.mllib.classification.LogisticRegressionModel = null m = newModel // newModel is the loaded one, see above post of mine val labelsAndPredsOnGoodData = goodDataPoints.map { point = val prediction =

Getting hadoop distcp to work on ephemeral-hsfs in spark-ec2 cluster

2014-08-14 Thread Arpan Ghosh
Hi, I have launched an AWS Spark cluster using the spark-ec2 script (--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can see the name node at master hostname:50070. When I try to copy files from S3 into ephemeral-HDFS using distcp using the following command:

Re: Spark webUI - application details page

2014-08-14 Thread SK
I set spark.eventLog.enabled to true in $SPARK_HOME/conf/spark-defaults.conf and also configured the logging to a file as well as console in log4j.properties. But I am not able to get the log of the statistics in a file. On the console there is a lot of log messages along with the stats - so

Spark working directories

2014-08-14 Thread Yana Kadiyska
Hi all, trying to change defaults of where stuff gets written. I've set -Dspark.local.dir=/spark/tmp and I can see that the setting is used when the executor is started. I do indeed see directories like spark-local-20140815004454-bb3f in this desired location but I also see undesired stuff under

Re: spark streaming - lamda architecture

2014-08-14 Thread Tathagata Das
Can you be a bit more specific about what you mean by lambda architecture? On Thu, Aug 14, 2014 at 2:27 PM, salemi alireza.sal...@udo.edu wrote: Hi, How would you implement the batch layer of lamda architecture with spark/spark streaming? Thanks, Ali -- View this message in context:

Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Shivaram Venkataraman
Could you try increasing the number of slices with the large data set ? SparkR assumes that each slice (or partition in Spark terminology) can fit in memory of a single machine. Also is the error happening when you do the map function or does it happen when you combine the results ? Thanks

Re: Spark webUI - application details page

2014-08-14 Thread Andrew Or
Hi SK, Not sure if I understand you correctly, but here is how the user normally uses the event logging functionality: After setting spark.eventLog.enabled and optionally spark.eventLog.dir, the user runs his/her Spark application and calls sc.stop() at the end of it. Then he/she goes to the

Re: spark streaming - lamda architecture

2014-08-14 Thread salemi
below is what is what I understand under lambda architecture. The batch layer provides the historical data and the speed layer provides the real-time view! All data entering the system is dispatched to both the batch layer and the speed layer for processing. The batch layer has two functions:

Re: Spark working directories

2014-08-14 Thread Calvin
I've had this issue too running Spark 1.0.0 on YARN with HDFS: it defaults to a working directory located in hdfs:///user/$USERNAME and it's not clear how to set the working directory. In the case where HDFS has a non-standard directory structure (i.e., home directories located in hdfs:///users/)

Re: spark streaming - lamda architecture

2014-08-14 Thread Michael Hausenblas
How would you implement the batch layer of lamda architecture with spark/spark streaming? I assume you’re familiar with resources such as https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark and are after more detailed advices? Cheers, Michael --

RE: spark streaming - lamda architecture

2014-08-14 Thread Shao, Saisai
Hi Ali, Maybe you can take a look at twitter's Summingbird project (https://github.com/twitter/summingbird), which is currently one of the few open source choices of lambda Architecture. There's a undergoing sub-project called summingbird-spark, that might be the one you wanted, might this can

None in RDD

2014-08-14 Thread guoxu1231
Hi Guys, I have a serious problem regarding the 'None' in RDD(pyspark). Take a example of transformations that produce 'None'. leftOuterJoin(self, other, numPartitions=None) Perform a left outer join of self and other. (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all

RE: Spark SQL Stackoverflow error

2014-08-14 Thread Cheng, Hao
I couldn’t reproduce the exception, probably it’s solved in the latest code. From: Vishal Vibhandik [mailto:vishal.vibhan...@gmail.com] Sent: Thursday, August 14, 2014 11:17 AM To: user@spark.apache.org Subject: Spark SQL Stackoverflow error Hi, I tried running the sample sql code JavaSparkSQL