Re: Facing error: java.lang.ArrayIndexOutOfBoundsException while executing SparkSQL join query

2015-02-28 Thread anamika gupta
The issue is now resolved. One of the csv files had an incorrect record at the end. On Fri, Feb 27, 2015 at 4:24 PM, anamika gupta anamika.guo...@gmail.com wrote: I have three tables with the following schema: case class* date_d*(WID: Int, CALENDAR_DATE: java.sql.Timestamp, DATE_STRING:

SparkSQL production readiness

2015-02-28 Thread Ashish Mukherjee
Hi, I am exploring SparkSQL for my purposes of performing large relational operations across a cluster. However, it seems to be in alpha right now. Is there any indication when it would be considered production-level? I don't see any info on the site. Regards, Ashish

RE: SparkSQL production readiness

2015-02-28 Thread Wang, Daoyuan
Hopefully the alpha tag will be remove in 1.4.0, if the community can review code a little bit faster :P Thanks, Daoyuan From: Ashish Mukherjee [mailto:ashish.mukher...@gmail.com] Sent: Saturday, February 28, 2015 4:28 PM To: user@spark.apache.org Subject: SparkSQL production readiness Hi, I

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-28 Thread Pat Ferrel
Maybe but any time the work around is to use spark-submit --conf spark.executor.extraClassPath=/guava.jar blah” that means that standalone apps must have hard coded paths that are honored on every worker. And as you know a lib is pretty much blocked from use of this version of Spark—hence the

Running in-memory SQL on streamed relational data

2015-02-28 Thread Ashish Mukherjee
Hi, I have been looking at Spark Streaming , which seems to be for the use case of live streams which are processed one line at a time generally in real-time. Since SparkSQL reads data from some filesystem, I was wondering if there is something which connects SparkSQL with Spark Streaming, so I

Re: Running in-memory SQL on streamed relational data

2015-02-28 Thread Akhil Das
I think you can do simple operations like foreachRDD or transform to get access to the RDDs in the stream and then you can do SparkSQL over it. Thanks Best Regards On Sat, Feb 28, 2015 at 3:27 PM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: Hi, I have been looking at Spark Streaming

SORT BY and ORDER BY file size v/s RAM size

2015-02-28 Thread DEVAN M.S.
*Hi devs,* *Is there any connection between the input file size and RAM size for sorting using SparkSQL ?* *I tried 1 GB file with 8 GB RAM with 4 cores and got java.lang.OutOfMemoryError: GC overhead limit exceeded.* *Or could it be for any other reason ? Its working for other SparkSQL

Reg. Difference in Performance

2015-02-28 Thread Deep Pradhan
Hi, I am running Spark applications in GCE. I set up cluster with different number of nodes varying from 1 to 7. The machines are single core machines. I set the spark.default.parallelism to the number of nodes in the cluster for each cluster. I ran the four applications available in Spark

bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Koert Kuipers
hey, running my first map-red like (meaning disk-to-disk, avoiding in memory RDDs) computation in spark on yarn i immediately got bitten by a too low spark.yarn.executor.memoryOverhead. however it took me about an hour to find out this was the cause. at first i observed failing shuffles leading to

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Ted Yu
I have created SPARK-6085 with pull request: https://github.com/apache/spark/pull/4836 Cheers On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: +1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
Thanks for taking this on Ted! On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote: I have created SPARK-6085 with pull request: https://github.com/apache/spark/pull/4836 Cheers On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: +1 to a better default as

Re: SparkSQL production readiness

2015-02-28 Thread Michael Armbrust
We are planning to remove the alpha tag in 1.3.0. On Sat, Feb 28, 2015 at 12:30 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hopefully the alpha tag will be remove in 1.4.0, if the community can review code a little bit faster :P Thanks, Daoyuan *From:* Ashish Mukherjee

Re: Tools to manage workflows on Spark

2015-02-28 Thread Qiang Cao
Thanks, Ashish! Is Oozie integrated with Spark? I knew it can accommodate some Hadoop jobs. On Sat, Feb 28, 2015 at 6:07 PM, Ashish Nigam ashnigamt...@gmail.com wrote: Qiang, Did you look at Oozie? We use oozie to run spark jobs in production. On Feb 28, 2015, at 2:45 PM, Qiang Cao

Re: Reg. Difference in Performance

2015-02-28 Thread Joseph Bradley
Hi Deep, Compute times may not be very meaningful for small examples like those. If you increase the sizes of the examples, then you may start to observe more meaningful trends and speedups. Joseph On Sat, Feb 28, 2015 at 7:26 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am

Re: Tools to manage workflows on Spark

2015-02-28 Thread Ashish Nigam
You have to call spark-submit from oozie. I used this link to get the idea for my implementation - http://mail-archives.apache.org/mod_mbox/oozie-user/201404.mbox/%3CCAHCsPn-0Grq1rSXrAZu35yy_i4T=fvovdox2ugpcuhkwmjp...@mail.gmail.com%3E On Feb 28, 2015, at 3:25 PM, Qiang Cao

How to debug a hung spark application

2015-02-28 Thread manasdebashiskar
Hi, I have a spark application that hangs on doing just one task (Rest 200-300 task gets completed in reasonable time) I can see in the Thread dump which function gets stuck how ever I don't have a clue as to what value is causing that behaviour. Also, logging the inputs before the function is

Re: How to debug a Hung task

2015-02-28 Thread Michael Albert
For what it's worth, I was seeing mysterious hangs, but it went away when upgrading from spark1.2 to 1.2.1.I don't know if this is your problem.Also, I'm using AWS EMR images, which were also upgraded. Anyway, that's my experience. -Mike From: Manas Kar manasdebashis...@gmail.com To:

Re: Getting to proto buff classes in Spark Context

2015-02-28 Thread John Meehan
Maybe try including the jar with --driver-class-path jar On Feb 26, 2015, at 12:16 PM, Akshat Aranya aara...@gmail.com wrote: My guess would be that you are packaging too many things in your job, which is causing problems with the classpath. When your jar goes in first, you get the

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
A correction to my first post: There is also a repartition right before groupByKey to help avoid too-many-open-files error: rdd2.union(rdd1).map(...).filter(...).repartition(15000).groupByKey().map(...).flatMap(...).saveAsTextFile() On Sat, Feb 28, 2015 at 11:10 AM, Arun Luthra

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-28 Thread Erlend Hamnaberg
Yes. I ran into this problem with mahout snapshot and spark 1.2.0 not really trying to figure out why that was a problem, since there were already too many moving parts in my app. Obviously there is a classpath issue somewhere. /Erlend On 27 Feb 2015 22:30, Pat Ferrel p...@occamsmachete.com

Tools to manage workflows on Spark

2015-02-28 Thread Qiang Cao
Hi Everyone, We need to deal with workflows on Spark. In our scenario, each workflow consists of multiple processing steps. Among different steps, there could be dependencies. I'm wondering if there are tools available that can help us schedule and manage workflows on Spark. I'm looking for

Re: Tools to manage workflows on Spark

2015-02-28 Thread Ashish Nigam
Qiang, Did you look at Oozie? We use oozie to run spark jobs in production. On Feb 28, 2015, at 2:45 PM, Qiang Cao caoqiang...@gmail.com wrote: Hi Everyone, We need to deal with workflows on Spark. In our scenario, each workflow consists of multiple processing steps. Among different

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-02-28 Thread Ashish Nigam
Ted, spark-catalyst_2.11-1.2.1.jar is present in the class path. BTW, I am running the code locally in eclipse workspace. Here’s complete exception stack trace - Exception in thread main scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-28 Thread Joseph Bradley
Hi Shahab, There are actually a few distributed Matrix types which support sparse representations: RowMatrix, IndexedRowMatrix, and CoordinateMatrix. The documentation has a bit more info about the various uses: http://spark.apache.org/docs/latest/mllib-data-types.html#distributed-matrix The

Re: Some questions after playing a little with the new ml.Pipeline.

2015-02-28 Thread Joseph Bradley
Hi Jao, You can use external tools and libraries if they can be called from your Spark program or script (with appropriate conversion of data types, etc.). The best way to apply a pre-trained model to a dataset would be to call the model from within a closure, e.g.: myRDD.map { myDatum =

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Sean Owen
There was a recent discussion about whether to increase or indeed make configurable this kind of default fraction. I believe the suggestion there too was that 9-10% is a safer default. Advanced users can lower the resulting overhead value; it may still have to be increased in some cases, but a

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Ted Yu
Having good out-of-box experience is desirable. +1 on increasing the default. On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote: There was a recent discussion about whether to increase or indeed make configurable this kind of default fraction. I believe the suggestion

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Aaron Davidson
All stated symptoms are consistent with GC pressure (other nodes timeout trying to connect because of a long stop-the-world), quite possibly due to groupByKey. groupByKey is a very expensive operation as it may bring all the data for a particular partition into memory (in particular, it cannot

Re: Scheduler hang?

2015-02-28 Thread Victor Tso-Guillen
Moving user to bcc. What I found was that the TaskSetManager for my task set that had 5 tasks had preferred locations set for 4 of the 5. Three had localhost/driver and had completed. The one that had nothing had also completed. The last one was set by our code to be my IP address. Local mode can

Re: Failed to parse Hive query

2015-02-28 Thread Anusha Shamanur
Hi, I reconfigured everything. Still facing the same issue. Can someone please help? On Friday, February 27, 2015, Anusha Shamanur anushas...@gmail.com wrote: I do. What tags should I change in this? I changed the value of hive.exec.scratchdir to /tmp/hive. What else? On Fri, Feb 27, 2015

Scalable JDBCRDD

2015-02-28 Thread Michal Klos
Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute the query against our huge database and not a substitute (SparkSQL, Hive, etc) because of a couple of factors including custom functions used in the

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Paweł Szulc
I would first check whether there is any possibility that after doing groupbykey one of the groups does not fit in one of the executors' memory. To back up my theory, instead of doing groupbykey + map try reducebykey + mapvalues. Let me know if that helped. Pawel Szulc http://rabbitonweb.com

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Paweł Szulc
But groupbykey will repartition according to numer of keys as I understand how it works. How do you know that you haven't reached the groupbykey phase? Are you using a profiler or do yoi base that assumption only on logs? sob., 28 lut 2015, 8:12 PM Arun Luthra użytkownik arun.lut...@gmail.com

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
So, actually I am removing the persist for now, because there is significant filtering that happens after calling textFile()... but I will keep that option in mind. I just tried a few different combinations of number of executors, executor memory, and more importantly, number of tasks... *all

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
The job fails before getting to groupByKey. I see a lot of timeout errors in the yarn logs, like: 15/02/28 12:47:16 WARN util.AkkaUtils: Error sending message in 1 attempts akka.pattern.AskTimeoutException: Timed out and 15/02/28 12:47:49 WARN util.AkkaUtils: Error sending message in 2

getting this error while runing

2015-02-28 Thread shahid
conf = SparkConf().setAppName(spark_calc3merged).setMaster(spark://ec2-54-145-68-13.compute-1.amazonaws.com:7077) sc = SparkContext(conf=conf,pyFiles=[/root/platinum.py,/root/collections2.py]) 15/02/28 19:06:38 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 3.0 (TID 38,

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
The Spark UI names the line number and name of the operation (repartition in this case) that it is performing. Only if this information is wrong (just a possibility), could it have started groupByKey already. I will try to analyze the amount of skew in the data by using reduceByKey (or simply

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
+1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test dataset we were using locally. It took me a couple days and digging through many logs to figure out this value was what was causing the problem. On Sat, Feb 28, 2015 at

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-02-28 Thread Ted Yu
Have you verified that spark-catalyst_2.10 jar was in the classpath ? Cheers On Sat, Feb 28, 2015 at 9:18 AM, Ashish Nigam ashnigamt...@gmail.com wrote: Hi, I wrote a very simple program in scala to convert an existing RDD to SchemaRDD. But createSchemaRDD function is throwing exception

Re: Missing shuffle files

2015-02-28 Thread Corey Nolet
Just wanted to point out- raising the memory-head (as I saw in the logs) was the fix for this issue and I have not seen dying executors since this calue was increased On Tue, Feb 24, 2015 at 3:52 AM, Anders Arpteg arp...@spotify.com wrote: If you thinking of the yarn memory overhead, then yes,

Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-02-28 Thread Ashish Nigam
Hi, I wrote a very simple program in scala to convert an existing RDD to SchemaRDD. But createSchemaRDD function is throwing exception Exception in thread main scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot

Re: getting this error while runing

2015-02-28 Thread shahid
Also the data file is on hdfs -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/getting-this-error-while-runing-tp21860p21861.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-02-28 Thread Ashish Nigam
Also, can scala version play any role here? I am using scala 2.11.5 but all spark packages have dependency to scala 2.11.2 Just wanted to make sure that scala version is not an issue here. On Sat, Feb 28, 2015 at 9:18 AM, Ashish Nigam ashnigamt...@gmail.com wrote: Hi, I wrote a very simple

Re: Accumulator in SparkUI for streaming

2015-02-28 Thread Tim Smith
So somehow Spark Streaming doesn't support display of named accumulators in the WebUI? On Tue, Feb 24, 2015 at 7:58 AM, Petar Zecevic petar.zece...@gmail.com wrote: Interesting. Accumulators are shown on Web UI if you are using the ordinary SparkContext (Spark 1.2). It just has to be named

Re: Tools to manage workflows on Spark

2015-02-28 Thread Mayur Rustagi
Sorry not really. Spork is a way to migrate your existing pig scripts to Spark or write new pig jobs then can execute on spark. For orchestration you are better off using Oozie especially if you are using other execution engines/systems besides spark. Regards, Mayur Rustagi Ph: +1 (760) 203 3257

Re: Tools to manage workflows on Spark

2015-02-28 Thread Qiang Cao
Thanks for the pointer, Ashish! I was also looking at Spork https://github.com/sigmoidanalytics/spork Pig-on-Spark), but wasn't sure if that's the right direction. On Sat, Feb 28, 2015 at 6:36 PM, Ashish Nigam ashnigamt...@gmail.com wrote: You have to call spark-submit from oozie. I used this

Connection pool in workers

2015-02-28 Thread A . K . M . Ashrafuzzaman
Hi guys, I am new to spark and we are running a small project that collects data from Kinesis and inserts in to mongo. I would like to share a high level view of how it is done and would love you input on it. I am fetching kinesis data and for each RDD - Parsing String data - Inserting into

Re: Tools to manage workflows on Spark

2015-02-28 Thread Ted Yu
Here was latest modification in spork repo: Mon Dec 1 10:08:19 2014 Not sure if it is being actively maintained. On Sat, Feb 28, 2015 at 6:26 PM, Qiang Cao caoqiang...@gmail.com wrote: Thanks for the pointer, Ashish! I was also looking at Spork https://github.com/sigmoidanalytics/spork

Re: Tools to manage workflows on Spark

2015-02-28 Thread Qiang Cao
Thanks Mayur! I'm looking for something that would allow me to easily describe and manage a workflow on Spark. A workflow in my context is a composition of Spark applications that may depend on one another based on hdfs inputs/outputs. Is Spork a good fit? The orchestration I want is on app level.

Re: Reg. Difference in Performance

2015-02-28 Thread Deep Pradhan
You mean the size of the data that we take? Thank You Regards, Deep On Sun, Mar 1, 2015 at 6:04 AM, Joseph Bradley jos...@databricks.com wrote: Hi Deep, Compute times may not be very meaningful for small examples like those. If you increase the sizes of the examples, then you may start to

Re: Tools to manage workflows on Spark

2015-02-28 Thread Mayur Rustagi
We do maintain it but in apache repo itself. However Pig cannot do orchestration for you. I am not sure what you are looking at from Pig in this context. Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoid.com http://www.sigmoidanalytics.com/ @mayur_rustagi

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-02-28 Thread Michael Armbrust
I think its possible that the problem is that the scala compiler is not being loaded by the primordial classloader (but instead by some child classloader) and thus the scala reflection mirror is failing to initialize when it can't find it. Unfortunately, the only solution that I know of is to load