Re: Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread ayan guha
Yes it is possible. You need to use jsonfile method on SQL context and then create a dataframe from the rdd. Then register it as a table. Should be 3 lines of code, thanks to spark. You may see few YouTube video esp for unifying pipelines. On 3 May 2015 19:02, Jai jai4l...@gmail.com wrote: Hi,

Hardware requirements

2015-05-03 Thread sherine ahmed
I need to use spark to upload a 500 GB data from hadoop on standalone mode cluster what are the minimum hardware requirements if it's known that it will be used for advanced analysis (social network analysis)? -- View this message in context:

Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread Jai
Hi, I am noob to spark and related technology. i have JSON stored at same location on all worker clients spark cluster). I am looking to load JSON data set on these clients and do SQL query, like distributed SQL. is it possible to achieve? right now, master submits task to one node only.

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
I don't know the full context of what you're doing, but serialization errors usually mean you're attempting to serialize something that can't be serialized, like the SparkContext. Kryo won't help there. The arguments to spark-submit you posted previously look good: 2) --num-executors 96

Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-03 Thread Nick Travers
I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm reading in both tables using a HiveContext with the underlying files stored as Parquet Files. I'm using something along the lines of

Re: Drop a column from the DataFrame.

2015-05-03 Thread Olivier Girardot
great thx Le sam. 2 mai 2015 à 23:58, Ted Yu yuzhih...@gmail.com a écrit : This is coming in 1.4.0 https://issues.apache.org/jira/browse/SPARK-7280 On May 2, 2015, at 2:27 PM, Olivier Girardot ssab...@gmail.com wrote: Sounds like a patch for a drop method... Le sam. 2 mai 2015 à 21:03,

Re: Questions about Accumulators

2015-05-03 Thread Eugen Cepoi
Yes that's it. If a partition is lost, to recompute it, some steps will need to be re-executed. Perhaps the map function in which you update the accumulator. I think you can do it more safely in a transformation near the action, where it is less likely that an error will occur (not always

Re: Questions about Accumulators

2015-05-03 Thread Dean Wampler
Yes, correct. However, note that when an accumulator operation is *idempotent*, meaning that repeated application for the same data behaves exactly like one application, then that accumulator can be safely called in transformation steps (non-actions), too. For example, max and min tracking. Just

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
How big is the data you're returning to the driver with collectAsMap? You are probably running out of memory trying to copy too much data back to it. If you're trying to force a map-side join, Spark can do that for you in some cases within the regular DataFrame/RDD context. See

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
IMHO, you are trying waaay to hard to optimize work on what is really a small data set. 25G, even 250G, is not that much data, especially if you've spent a month trying to get something to work that should be simple. All these errors are from optimization attempts. Kryo is great, but if it's not

Re: Selecting download for 'hadoop 2.4 and later

2015-05-03 Thread Sean Owen
See https://issues.apache.org/jira/browse/SPARK-5492 but I think you'll need to share the stack trace as I'm not sure how this can happen since the NoSuchMethodError (not NoSuchMethodException) indicates a call in the bytecode failed to link but there is only a call by reflection. On Fri, May 1,

Questions about Accumulators

2015-05-03 Thread xiazhuchang
The official document said In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed. I don't quite understand what is this mean. is that meas if i use the accumulator in transformations(i.e. map() operation), this

Re: Questions about Accumulators

2015-05-03 Thread Ignacio Blasco
Given the lazy nature of an RDD if you use an accumulator inside a map() and then you call count and saveAsTextfile over that accumulator will be called twice. IMHO, accumulators are a bit nondeterministic you need to be sure when to read them to avoid unexpected re-executions El 3/5/2015 2:09 p.

Re: Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread Ted Yu
Looking at SQLContext.scala (in master branch), jsonFile() returns DataFrame directly: def jsonFile(path: String, samplingRatio: Double): DataFrame = FYI On Sun, May 3, 2015 at 2:14 AM, ayan guha guha.a...@gmail.com wrote: Yes it is possible. You need to use jsonfile method on SQL context

PriviledgedActionException- Executor error

2015-05-03 Thread podioss
Hi, i am running several jobs in standalone mode and i notice this error in the log files in some of my nodes at the start of my jobs: INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] INFO spark.SecurityManager: Changing view acls to: root INFO

Re: Questions about Accumulators

2015-05-03 Thread xiazhuchang
“For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks

Re: Spark SQL ThriftServer Impersonation Support

2015-05-03 Thread Night Wolf
Thanks Andrew. What version of HS2 is the SparkSQL thrift server using? What would be involved in updating? Is it a simple case of increasing the deep version in one of the project POMs? Cheers, ~N On Sat, May 2, 2015 at 11:38 AM, Andrew Lee alee...@hotmail.com wrote: Hi N, See:

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread ๏̯͡๏
Hello Dean Others, Thanks for your suggestions. I have two data sets and all i want to do is a simple equi join. I have 10G limit and as my dataset_1 exceeded that it was throwing OOM error. Hence i switched back to use .join() API instead of map-side broadcast join. I am repartitioning the data

Re: Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread Dean Wampler
Note that each JSON object has to be on a single line in the files. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread ๏̯͡๏
Hello Deam, If I don;t use Kryo serializer i got Serialization error and hence am using it. If I don';t use partitionBy/reparition then the simply join never completed even after 7 hours and infact as next step i need to run it against 250G as that is my full dataset size. Someone here suggested

Re: PySpark: slicing issue with dataframes

2015-05-03 Thread Ali Bajwa
Friendly reminder on this one. Just wanted to get a confirmation that this is not by design before I logged a JIRA Thanks! Ali On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Trying to use the slicing functionality in strings as part of a Spark program

How to skip corrupted avro files

2015-05-03 Thread Shing Hing Man
Hi, I am using Spark 1.3.1 to read a directory of about 2000 avro files. The avro files are from a third party and a few of them are corrupted. val path = {myDirecotry of avro files} val sparkConf = new SparkConf().setAppName(avroDemo).setMaster(local) val sc = new SparkContext(sparkConf)