Re: akka disassociated on GC

2014-07-16 Thread Xiangrui Meng
Hi Makoto, I don't remember I wrote that but thanks for bringing this issue up! There are two important settings to check: 1) driver memory (you can see it from the executor tab), 2) number of partitions (try to use small number of partitions). I put two PRs to fix the problem: 1) use broadcast

Re: Error when testing with large sparse svm

2014-07-16 Thread Xiangrui Meng
Then it may be a new issue. Do you mind creating a JIRA to track this issue? It would be great if you can help locate the line in BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui On Tue, Jul 15, 2014 at 10:56 PM, crater cq...@ucmerced.edu wrote: I don't really have my code,

Error: No space left on device

2014-07-16 Thread Chris DuBois
Hi all, I am encountering the following error: INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space left on device [duplicate 4] For each slave, df -h looks roughtly like this, which makes the above error surprising. FilesystemSize Used Avail Use% Mounted

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
Check the number of inodes (df -i). The assembly build may create many small files. -Xiangrui On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote: Hi all, I am encountering the following error: INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois
df -i # on a slave FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 277701 246587 53% / tmpfs1917974 1 19179731% /dev/shm On Tue, Jul 15, 2014 at 11:39 PM, Xiangrui Meng men...@gmail.com wrote: Check the number of inodes

Re: Error: No space left on device

2014-07-16 Thread Chris Gore
Hi Chris, I've encountered this error when running Spark’s ALS methods too. In my case, it was because I set spark.local.dir improperly, and every time there was a shuffle, it would spill many GB of data onto the local drive. What fixed it was setting it to use the /mnt directory, where a

Re: Ambiguous references to id : what does it mean ?

2014-07-16 Thread Jaonary Rabarisoa
My query is just a simple query that use the spark sql dsl : tagCollection.join(selectedVideos).where('videoId === 'id) On Tue, Jul 15, 2014 at 6:03 PM, Yin Huai huaiyin@gmail.com wrote: Hi Jao, Seems the SQL analyzer cannot resolve the references in the Join condition. What is your

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois
Thanks for the quick responses! I used your final -Dspark.local.dir suggestion, but I see this during the initialization of the application: 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory at /vol/spark-local-20140716065608-7b2a I would have expected something in

Re: akka disassociated on GC

2014-07-16 Thread Makoto Yui
Hi Xiangrui, (2014/07/16 15:05), Xiangrui Meng wrote: I don't remember I wrote that but thanks for bringing this issue up! There are two important settings to check: 1) driver memory (you can see it from the executor tab), 2) number of partitions (try to use small number of partitions). I put

Re: How does Spark speculation prevent duplicated work?

2014-07-16 Thread Mingyu Kim
That makes sense. Thanks everyone for the explanations! Mingyu From: Matei Zaharia matei.zaha...@gmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Tuesday, July 15, 2014 at 3:00 PM To: user@spark.apache.org user@spark.apache.org Subject: Re: How does Spark speculation

How does Apache Spark handles system failure when deployed in YARN?

2014-07-16 Thread Matthias Kricke
Hello @ the mailing list, We think of using spark in one of our projects in a Hadoop cluster. During evaluation several questions remain which are stated below. Preconditions Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
Hi Chris, Could you also try `df -i` on the master node? How many blocks/partitions did you set? In the current implementation, ALS doesn't clean the shuffle data because the operations are chained together. But it shouldn't run out of disk space on the MovieLens dataset, which is small.

Re: Kyro deserialisation error

2014-07-16 Thread Hao Wang
Thanks for your reply. The SparkContext is configured as below: sparkConf.setAppName(WikipediaPageRank) sparkConf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) sparkConf.set(spark.kryo.registrator, classOf[PRKryoRegistrator].getName) val inputFile = args(0)

Re: How does Apache Spark handles system failure when deployed in YARN?

2014-07-16 Thread Sandy Ryza
Hi Matthias, Answers inline. -Sandy On Wed, Jul 16, 2014 at 12:21 AM, Matthias Kricke matthias.kri...@mgm-tp.com wrote: Hello @ the mailing list, We think of using spark in one of our projects in a Hadoop cluster. During evaluation several questions remain which are stated below.

AW: How does Apache Spark handles system failure when deployed in YARN?

2014-07-16 Thread Matthias Kricke
Thanks, your answers totally cover all my questions ☺ Von: Sandy Ryza [mailto:sandy.r...@cloudera.com] Gesendet: Mittwoch, 16. Juli 2014 09:41 An: user@spark.apache.org Betreff: Re: How does Apache Spark handles system failure when deployed in YARN? Hi Matthias, Answers inline. -Sandy On Wed,

Spark Streaming, external windowing?

2014-07-16 Thread Sargun Dhillon
Does anyone here have a way to do Spark Streaming with external timing for windows? Right now, it relies on the wall clock of the driver to determine the amount of time that each batch read lasts. We have a Kafka, and HDFS ingress into our Spark Streaming pipeline where the events are annotated

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois
Hi Xiangrui, Here is the result on the master node: $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 273997 250291 53% / tmpfs1917974 1 19179731% /dev/shm /dev/xvdv524288000 30 5242879701% /vol I

Re: Spark Streaming, external windowing?

2014-07-16 Thread Gerard Maas
Hi Sargun, There have been few discussions on the list recently about the topic. The short answer is that this is not supported at the moment. This is a particularly good thread as it discusses the current state and limitations:

RE: executor-cores vs. num-executors

2014-07-16 Thread innowireless TaeYun Kim
Thanks. Really, now I compare a stage data of the two jobs, ‘core7-exec3’ spends about 12.5 minutes more than ‘core2-exec12’ on GC. From: Nishkam Ravi [mailto:nr...@cloudera.com] Sent: Wednesday, July 16, 2014 5:28 PM To: user@spark.apache.org Subject: Re: executor-cores vs.

Re: Need help on spark Hbase

2014-07-16 Thread Madabhattula Rajesh Kumar
Hi Team, Now i've changed my code and reading configuration from hbase-site.xml file(this file is in classpath). When i run this program using : mvn exec:java -Dexec.mainClass=com.cisco.ana.accessavailability.AccessAvailability. It is working fine. But when i run this program from spark-submit

Server IPC version 7 cannot communicate with client version 4 with Spark Streaming 1.0.0 in Java and CH4 quickstart in local mode

2014-07-16 Thread Juan Rodríguez Hortalá
Hi, I'm running a Java program using Spark Streaming 1.0.0 on Cloudera 4.4.0 quickstart virtual machine, with hadoop-client 2.0.0-mr1-cdh4.4.0, which is the one corresponding to my Hadoop distribution, and that works with other mapreduce programs, and with the maven property

Re: parallel stages?

2014-07-16 Thread Sean Owen
Yes, but what I show can be done in one Spark job. On Wed, Jul 16, 2014 at 5:01 AM, Wei Tan w...@us.ibm.com wrote: Thanks Sean. In Oozie you can use fork-join, however using Oozie to drive Spark jobs, jobs will not be able to share RDD (Am I right? I think multiple jobs submitted by Oozie will

Reading file header in Spark

2014-07-16 Thread Silvina Caíno Lores
Hi everyone! I'm really new to Spark and I'm trying to figure out which would be the proper way to do the following: 1.- Read a file header (a single line) 2.- Build with it a configuration object 3.- Use that object in a function that will be called by map() I thought about using filter()

Re: Reading file header in Spark

2014-07-16 Thread Sean Owen
You can rdd.take(1) to get just the header line. I think someone mentioned before that this is a good use case for having a tail method on RDDs too, to skip the header for subsequent processing. But you can ignore it with a filter, or logic in your map method. On Wed, Jul 16, 2014 at 11:01 AM,

Re: Reading file header in Spark

2014-07-16 Thread Silvina Caíno Lores
Thank you! This is what I needed, I've read it should work as the first() method as well. It's a pity that the taken element cannot be removed from the RDD though. Thanks again! On 16 July 2014 12:09, Sean Owen so...@cloudera.com wrote: You can rdd.take(1) to get just the header line. I

Re: Server IPC version 7 cannot communicate with client version 4 with Spark Streaming 1.0.0 in Java and CH4 quickstart in local mode

2014-07-16 Thread Sean Owen
Server IPC version 7 cannot communicate with client version 4 means your client is Hadoop 1.x and your cluster is Hadoop 2.x. The default Spark distribution is built for Hadoop 1.x. You would have to make your own build (or, use the artifacts distributed for CDH4.6 maybe? they are certainly built

Read all the columns from a file in spark sql

2014-07-16 Thread pandees waran
Hi, I am newbie to spark sql and i would like to know about how to read all the columns from a file in spark sql. I have referred the programming guide here: http://people.apache.org/~tdas/spark-1.0-docs/sql-programming-guide.html The example says: val people =

Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-16 Thread Rohit Pujari
Thanks Matei. On Tue, Jul 15, 2014 at 11:47 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, as mentioned in the FAQ, we are aware of multiple deployments running jobs on over 1000 nodes. Some of our proof of concepts involved people running a 2000-node job on EC2. I wouldn't confuse

Re: count vs countByValue in for/yield

2014-07-16 Thread Ognen Duzlevski
Hello all, Can anyone offer any insight on the below? Both are legal Spark but the first one works, the latter one does not. They both work on a local machine but in a standalone cluster the one with countByValue fails. Thanks! Ognen On 7/15/14, 2:23 PM, Ognen Duzlevski wrote: Hello, I

Problem running Spark shell (1.0.0) on EMR

2014-07-16 Thread Ian Wilkinson
Hi, I’m trying to run the Spark (1.0.0) shell on EMR and encountering a classpath issue. I suspect I’m missing something gloriously obviously, but so far it is eluding me. I launch the EMR Cluster (using the aws cli) with: aws emr create-cluster --name Test Cluster \ --ami-version

Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Hi All, I'm trying to do a simple record matching between 2 files and wrote following code - *import org.apache.spark.sql.SQLContext;* *import org.apache.spark.rdd.RDD* *object SqlTest {* * case class Test(fld1:String, fld2:String, fld3:String, fld4:String, fld4:String, fld5:Double,

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
Check your executor logs for the output or if your data is not big collect it in the driver and print it. On Jul 16, 2014, at 9:21 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi All, I'm trying to do a simple record matching between 2 files and wrote following

Re: Re: how to construct a ClassTag object as a method parameter in Java

2014-07-16 Thread balvisio
Hi, I think same issue is happening with the constructor of the PartitionPruningRDD class. It hasn't been fixed in version 1.0.1 Should this be reported to JIRA? -- View this message in context:

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Hi Soumya, Data is very small, 500+ lines in each file. Removed last 2 lines and placed this at the end matched.collect().foreach(println);. Still no luck. It's been more than 5min, the execution is still running. Checked logs, nothing in stdout. In stderr I don't see anything going wrong, all

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
When you submit your job, it should appear on the Spark UI. Same with the REPL. Make sure you job is submitted to the cluster properly. On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi Soumya, Data is very small, 500+ lines in each file.

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Yes it is appearing on the Spark UI, and remains there with state as RUNNING till I press Ctrl+C in the terminal to kill the execution. Barring the statements to create the spark context, if I copy paste the lines of my code in spark shell, runs perfectly giving the desired output. ~Sarath On

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
Can you try submitting a very simple job to the cluster. On Jul 16, 2014, at 10:25 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Yes it is appearing on the Spark UI, and remains there with state as RUNNING till I press Ctrl+C in the terminal to kill the execution.

Re: Read all the columns from a file in spark sql

2014-07-16 Thread Michael Armbrust
I think what you might be looking for is the ability to programmatically specify the schema, which is coming in 1.1. Here's the JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Wed, Jul 16, 2014 at 8:24 AM, pandees waran pande...@gmail.com wrote: Hi, I am newbie to spark

Re: Ambiguous references to id : what does it mean ?

2014-07-16 Thread Michael Armbrust
Yes, but if both tagCollection and selectedVideos have a column named id then Spark SQL does not know which one you are referring to in the where clause. Here's an example with aliases: val x = testData2.as('x) val y = testData2.as('y) val join = x.join(y, Inner, Some(x.a.attr ===

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Yes Soumya, I did it. First I tried with the example available in the documentation (example using people table and finding teenagers). After successfully running it, I moved on to this one which is starting point to a bigger requirement for which I'm evaluating Spark SQL. On Wed, Jul 16, 2014

Re: Simple record matching using Spark SQL

2014-07-16 Thread Michael Armbrust
What if you just run something like: *sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv).count()* On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Yes Soumya, I did it. First I tried with the example available in the documentation

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Hi Michael, Tried it. It's correctly printing the line counts of both the files. Here's what I tried - *Code:* *package test* *object Test4 {* * case class Test(fld1: String, * * fld2: String, * * fld3: String, * * fld4: String, * * fld5: String, * * fld6: Double, * * fld7:

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois
Hi Xiangrui, I accidentally did not send df -i for the master node. Here it is at the moment of failure: FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 280938 243350 54% / tmpfs3845409 1 38454081% /dev/shm /dev/xvdb

Re: Spark Streaming Json file groupby function

2014-07-16 Thread srinivas
Hi TD, I Defines the Case Class outside the main method and was able to compile the code successfully. But getting a run time error when trying to process some json file from kafka. here is the code i an to compile import java.util.Properties import kafka.producer._ import

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner
Should I take it from the lack of replies that the --ebs-vol-size feature doesn't work? -Ben -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-spark-ec2-script-ebs-vol-size-tp9619p9934.html Sent from the Apache Spark User List mailing list

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner
please add From: Ben Horner [via Apache Spark User List] ml-node+s1001560n9934...@n3.nabble.commailto:ml-node+s1001560n9934...@n3.nabble.com Date: Wednesday, July 16, 2014 at 8:47 AM To: Ben Horner ben.hor...@atigeo.commailto:ben.hor...@atigeo.com Subject: Re: Trouble with spark-ec2 script:

Gradient Boosting Decision Trees

2014-07-16 Thread Pedro Silva
Hi there, I am looking for a GBM MLlib implementation. Does anyone know if there is a plan to roll it out soon? Thanks! Pedro

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Tom
Hi Burak, Thank you for your pointer, it is really helping out. I do have some consecutive questions though. After looking at the Big Data Benchmark page https://amplab.cs.berkeley.edu/benchmark/ (Section Run this benchmark yourself), I was expecting the following combination of files: Sets:

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Chris DuBois
Hi Ben, It worked for me, but only when using the default region. Using --region=us-west-2 resulted in errors about security groups. Chris On Wed, Jul 16, 2014 at 8:53 AM, Ben Horner ben.hor...@atigeo.com wrote: please add From: Ben Horner [via Apache Spark User List] [hidden email]

Re: Terminal freeze during SVM

2014-07-16 Thread AlexanderRiggers
so I need to reconfigure my sparkcontext this way: val conf = new SparkConf() .setMaster(local) .setAppName(CountingSheep) .set(spark.executor.memory, 1g) .set(spark.akka.frameSize,20) val sc = new SparkContext(conf) And start a new cluster

Re: Gradient Boosting Decision Trees

2014-07-16 Thread Ameet Talwalkar
Hi Pedro, Yes, although they will probably not be included in the next release (since the code freeze is ~2 weeks away), GBM (and other ensembles of decision trees) are currently under active development. We're hoping they'll make it into the subsequent release. -Ameet On Wed, Jul 16, 2014 at

running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
Hello community, tried to run storm app on yarn, using cloudera hadoop and spark distro (from http://archive.cloudera.com/cdh5/cdh/5) hadoop version: hadoop-2.3.0-cdh5.0.3.tar.gz spark version: spark-0.9.0-cdh5.0.3.tar.gz DEFAULT_YARN_APPLICATION_CLASSPATH is part of hadoop-api-yarn jar ...

Re: Spark Streaming Json file groupby function

2014-07-16 Thread Yin Huai
Hi Srinivas, Seems the query you used is val results =sqlContext.sql(select type from table1). However, table1 does not have a field called type. The schema of table1 is defined as the class definition of your case class Record (i.e. ID, name, score, and school are fields of your table1). Can you

Re: Gradient Boosting Decision Trees

2014-07-16 Thread Pedro Silva
Hi Ameet, that's great news! Thanks, Pedro On Wed, Jul 16, 2014 at 9:33 AM, Ameet Talwalkar atalwal...@gmail.com wrote: Hi Pedro, Yes, although they will probably not be included in the next release (since the code freeze is ~2 weeks away), GBM (and other ensembles of decision trees) are

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Sean Owen
Somewhere in here, you are not actually running vs Hadoop 2 binaries. Your cluster is certainly Hadoop 2, but your client is not using the Hadoop libs you think it is (or your compiled binary is linking against Hadoop 1, which is the default for Spark -- did you change it?) On Wed, Jul 16, 2014

using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Hi, My application has multiple dstreams on the same inputstream: dstream1 // 1 second window dstream2 // 2 second window dstream3 // 5 minute window I want to write logic that deals with all three windows (e.g. when the 1 second window differs from the 2 second window by some delta ...) I've

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Sandy Ryza
Andrew, Are you running on a CM-managed cluster? I just checked, and there is a bug here (fixed in 1.0), but it's avoided by having yarn.application.classpath defined in your yarn-site.xml. -Sandy On Wed, Jul 16, 2014 at 10:02 AM, Sean Owen so...@cloudera.com wrote: Somewhere in here, you

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Luis Ángel Vicente Sánchez
I'm joining several kafka dstreams using the join operation but you have the limitation that the duration of the batch has to be same,i.e. 1 second window for all dstreams... so it would not work for you. 2014-07-16 18:08 GMT+01:00 Walrus theCat walrusthe...@gmail.com: Hi, My application has

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Sean Owen
OK, if you're sure your binary has Hadoop 2 and/or your classpath has Hadoop 2, that's not it. I'd look at Sandy's suggestion then. On Wed, Jul 16, 2014 at 6:11 PM, Andrew Milkowski amgm2...@gmail.com wrote: thanks Sean! so what I did is in project/SparkBuild.scala I made it compile with

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
thanks Sandzy, no CM-managed cluster, straight from cloudera tar ( http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.3.tar.gz) trying your suggestion immediate! thanks so much for taking time.. On Wed, Jul 16, 2014 at 1:10 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Andrew, Are

Re: Number of executors change during job running

2014-07-16 Thread Bill Jay
Hi Tathagata, I have tried the repartition method. The reduce stage first had 2 executors and then it had around 85 executors. I specified repartition(300) and each of the executors were specified 2 cores when I submitted the job. This shows repartition works to increase more executors. However,

ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Svend
Hi all, I just installed a mesos 0.19 cluster. I am failing to execute basic SparkQL operations on text files with Spark 1.0.1 with the spark-shell. I have one Mesos master without zookeeper and 4 mesos slaves. All nodes are running JDK 1.7.51 and Scala 2.10.4. The spark package is

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Yeah -- I tried the .union operation and it didn't work for that reason. Surely there has to be a way to do this, as I imagine this is a commonly desired goal in streaming applications? On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: I'm joining

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Luis Ángel Vicente Sánchez
hum... maybe consuming all streams at the same time with an actor that would act as a new DStream source... but this is just a random idea... I don't really know if that would be a good idea or even possible. 2014-07-16 18:30 GMT+01:00 Walrus theCat walrusthe...@gmail.com: Yeah -- I tried the

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Or, if not, is there a way to do this in terms of a single dstream? Keep in mind that dstream1, dstream2, and dstream3 have already had transformations applied. I tried creating the dstreams by calling .window on the first one, but that ends up with me having ... 3 dstreams... which is the same

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
hey at least it's something (thanks!) ... not sure what i'm going to do if i can't find a solution (other than not use spark) as i really need these capabilities. anyone got anything else? On Wed, Jul 16, 2014 at 10:34 AM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: hum...

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
Sandy, perfect! you saved me tons of time! added this in yarn-site.xml job ran to completion Can you do me (us) a favor and push newest and patched spark/hadoop to cdh5 (tar's) if possible and thanks again for this (huge time saver) On Wed, Jul 16, 2014 at 1:10 PM, Sandy Ryza

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
For others, to solve topic problem: in yarn-site.xml add: property nameyarn.application.classpath/name value$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*,

Re: Need help on spark Hbase

2014-07-16 Thread Jerry Lam
Hi Rajesh, I saw : Warning: Local jar /home/rajesh/hbase-0.96.1.1-hadoop2/lib/hbase -client-0.96.1.1-hadoop2.jar, does not exist, skipping. in your log. I believe this jar contains the HBaseConfiguration. I'm not sure what went wrong in your case but can you try without spaces in --jars i.e.

Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread hsy...@gmail.com
When I'm reading the API of spark streaming, I'm confused by the 3 different durations StreamingContext(conf: SparkConf http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html , batchDuration: Duration

RE: executor-cores vs. num-executors

2014-07-16 Thread Wei Tan
Thanks for sharing your experience. I got the same experience -- multiple moderate JVMs beat a single huge JVM. Besides the minor JVM starting overhead, is it always better to have multiple JVMs rather than a single one? Best regards, Wei - Wei Tan, PhD

Re: Kyro deserialisation error

2014-07-16 Thread Tathagata Das
Is the class that is not found in the wikipediapagerank jar? TD On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wh.s...@gmail.com wrote: Thanks for your reply. The SparkContext is configured as below: sparkConf.setAppName(WikipediaPageRank) sparkConf.set(spark.serializer,

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Burak Yavuz
Hi Tom, Actually I was mistaken, sorry about that. Indeed on the website, the keys for the datasets you mention are not showing up. However, they are still accessible through the spark-shell, which means that they are there. So in order to answer your questions: - Are the tiny and 1node sets

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Michael Armbrust
Mostly true. The execution of two equivalent logical plans will be exactly the same, independent of the dialect. Resolution can be slightly different as SQLContext defaults to case sensitive and HiveContext defaults to case insensitive. One other very technical detail: The actual planning done

Re: SPARK_WORKER_PORT (standalone cluster)

2014-07-16 Thread jay vyas
Now I see the answer to this. Spark slaves are start on random ports, and tell the master where they are. then the master acknowledges them. (worker logs) Starting Spark worker :43282 (master logs) Registering worker on :43282 with 8 cores, 16.5 GB RAM Thus, the port is random because

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread aaronjosephs
The only other thing to keep in mind is that window duration and slide duration have to be multiples of batch duration, IDK if you made that fully clear -- View this message in context:

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-16 Thread Matt Work Coarr
Thanks Marcelo, I'm not seeing anything in the logs that clearly explains what's causing this to break. One interesting point that we just discovered is that if we run the driver and the slave (worker) on the same host it runs, but if we run the driver on a separate host it does not run.

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
Note that runnning a simple map+reduce job on the same hdfs files with the same installation works fine: Did you call collect() on the totalLength? Otherwise nothing has actually executed.

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
Oh, I'm sorry... reduce is also an operation On Wed, Jul 16, 2014 at 3:37 PM, Michael Armbrust mich...@databricks.com wrote: Note that runnning a simple map+reduce job on the same hdfs files with the same installation works fine: Did you call collect() on the totalLength? Otherwise

SaveAsTextFile of RDD taking much time

2014-07-16 Thread sudiprc
Hi All,I am new to Spark. Written a program to read data from local big file, sort using Spark SQL and then filter based some validation rules. I have tested this program with 23860746 lines of file, and it took 39 secs (2 cores and Xmx as 6gb). But, when I want to serializing it to a local file,

Re: Spark Streaming, external windowing?

2014-07-16 Thread Tathagata Das
One way to do that is currently possible is given here http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAMwrk0=b38dewysliwyc6hmze8tty8innbw6ixatnd1ue2-...@mail.gmail.com%3E On Wed, Jul 16, 2014 at 1:16 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi Sargun, There have

Re: can't print DStream after reduce

2014-07-16 Thread Tathagata Das
Yeah. I have been wondering how to check this in the general case, across all deployment modes, but thats a hard problem. Last week I realized that even if we can do it just for local, we can get the biggest bang of the buck. TD On Tue, Jul 15, 2014 at 9:31 PM, Tobias Pfeiffer t...@preferred.jp

Re: Multiple streams at the same time

2014-07-16 Thread Tathagata Das
I hope it all works :) On Wed, Jul 16, 2014 at 9:08 AM, gorenuru goren...@gmail.com wrote: Hi and thank you for your reply. Looks like it's possible. It looks like a hack for me because we are specifying batch duration when creating context. This means that if we will specify batch

Re: Spark Streaming Json file groupby function

2014-07-16 Thread Tathagata Das
I think I know what the problem is. Spark Streaming is constantly doing garbage cleanup by throwing away data that it does not based on the operations in the DStream. Here the DSTream operations are not aware of the spark sql queries thats happening asynchronous to spark streaming. So data is

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Svend
Hi Michael, Thanks for your reply. Yes, the reduce triggered the actual execution, I got a total length (totalLength: 95068762, for the record). -- View this message in context:

Release date for new pyspark

2014-07-16 Thread Paul Wais
Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 I downloaded the recent 1.0.1 release and was surprised to see the distribution did not include these

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
H, it could be some weirdness with classloaders / Mesos / spark sql? I'm curious if you would hit an error if there were no lambda functions involved. Perhaps if you load the data using jsonFile or parquetFile. Either way, I'd file a JIRA. Thanks! On Jul 16, 2014 6:48 PM, Svend

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Sandy Ryza
Hi Ron, I just checked and this bug is fixed in recent releases of Spark. -Sandy On Sun, Jul 13, 2014 at 8:15 PM, Chester Chen ches...@alpinenow.com wrote: Ron, Which distribution and Version of Hadoop are you using ? I just looked at CDH5 ( hadoop-mapreduce-client-core-

Re: Release date for new pyspark

2014-07-16 Thread Mark Hamstra
You should expect master to compile and run: patches aren't merged unless they build and pass tests on Jenkins. You shouldn't expect new features to be added to stable code in maintenance releases (e.g. 1.0.1). AFAIK, we're still on track with Spark 1.1.0 development, which means that it should

Re: Cassandra driver Spark question

2014-07-16 Thread RodrigoB
Tnks to both for the comments and the debugging suggestion, I will try to use. Regarding you comment, yes I do agree the current solution was not efficient but for using the saveToCassandra method I need an RDD thus the paralelize method. I finally got direct by Piotr to use the

Re: Memory compute-intensive tasks

2014-07-16 Thread rpandya
Matei - I tried using coalesce(numNodes, true), but it then seemed to run too few SNAP tasks - only 2 or 3 when I had specified 46. The job failed, perhaps for unrelated reasons, with some odd exceptions in the log (at the end of this message). But I really don't want to force data movement

Re: Release date for new pyspark

2014-07-16 Thread Matei Zaharia
Yeah, we try to have a regular 3 month release cycle; see https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current window. Matei On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote: You should expect master to compile and run: patches aren't merged

Re: Memory compute-intensive tasks

2014-07-16 Thread Liquan Pei
Hi Ravi, I have seen a similar issue before. You can try to set fs.hdfs.impl.disable.cache to true in your hadoop configuration. For example, suppose your hadoop configuration file is hadoopConf, you can use hadoopConf.setBoolean(fs.hdfs.impl.disable.cache, true) Let me know if that helps.

Spark Streaming timestamps

2014-07-16 Thread Bill Jay
Hi all, I am currently using Spark Streaming to conduct a real-time data analytics. We receive data from Kafka. We want to generate output files that contain results that are based on the data we receive from a specific time interval. I have several questions on Spark Streaming's timestamp: 1)

spark-ec2 script with Tachyon

2014-07-16 Thread nit
Hi, It seems that spark-ec2 script deploys Tachyon module along with other setup. I am trying to use .persist(OFF_HEAP) for RDD persistence, but on worker I see this error -- Failed to connect (2) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused -- From

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Tathagata Das
Have you taken a look at DStream.transformWith( ... ) . That allows you apply arbitrary transformation between RDDs (of the same timestamp) of two different streams. So you can do something like this. 2s-window-stream.transformWith(1s-window-stream, (rdd1: RDD[...], rdd2: RDD[...]) = { ... //

Use Spark with HBase' HFileOutputFormat

2014-07-16 Thread Jianshi Huang
Hi, I want to use Spark with HBase and I'm confused about how to ingest my data using HBase' HFileOutputFormat. It recommends calling configureIncrementalLoad which does the following: - Inspects the table to configure a total order partitioner - Uploads the partitions file to the cluster

Re: Spark Streaming timestamps

2014-07-16 Thread Tathagata Das
Answers inline. On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently using Spark Streaming to conduct a real-time data analytics. We receive data from Kafka. We want to generate output files that contain results that are based on the data we

Kmeans

2014-07-16 Thread amin mohebbi
Can anyone explain to me what is difference between kmeans in Mlib and kmeans in examples/src/main/python/kmeans.py?   Best Regards ... Amin Mohebbi PhD candidate in Software Engineering   at university of Malaysia   H#x2F;P : +60 18

Re: Release date for new pyspark

2014-07-16 Thread Michael Armbrust
You should try cleaning and then building. We have recently hit a bug in the scala compiler that sometimes causes non-clean builds to fail. On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, we try to have a regular 3 month release cycle; see

  1   2   >