Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Nicolas Paris
IMO your json cannot be read in parallell at all then spark only offers you to play again with memory. I d'say at one step it has to feet in both one executor and in the driver. I d'try something like 20GB for both driver and executors and by using dynamic amount of executor in order to then

Re: Apply Core Java Transformation UDF on DataFrame

2018-06-05 Thread Chetan Khatri
Anyone can throw light on this. would be helpful. On Tue, Jun 5, 2018 at 1:41 AM, Chetan Khatri wrote: > All, > > I would like to Apply Java Transformation UDF on DataFrame created from > Table, Flat Files and retrun new Data Frame Object. Any suggestions, with > respect to Spark Internals. > >

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Yes I would say thats the first thing that i tried. thing is even though i provide more num executor and more memory to each, this process gets OOM in only one task which is stuck and unfinished. I dont think its splitting the load to other tasks. I had 11 blocks on that file i stored in hdfs

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Nicolas Paris
have you played with driver/executor memory configuration ? Increasing them should avoid OOM 2018-06-05 22:30 GMT+02:00 raksja : > Agreed, gzip or non splittable, the question that i have and examples i > have > posted above all are referring to non compressed file. A single json file > with

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Agreed, gzip or non splittable, the question that i have and examples i have posted above all are referring to non compressed file. A single json file with Array of objects in a continuous line. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
Yes its in right format as we are able to process that in python. Also I agree that JSONL would work when split that [{},{},...] array of objects into something like this {} {} {} But since i get the data from another system like this i cannot control, my question is whether its possible

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Holden Karau
If it’s one 33mb file which decompressed to 1.5g then there is also a chance you need to split the inputs since gzip is a non-splittable compression format. On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias wrote: > Are you sure that your JSON file has the right format? > >

Spark maxTaskFailures is not recognized with Cassandra

2018-06-05 Thread ravidspark
Hi All, I configured the number of task failures using spark.task.maxFailures as 10 in my spark application which ingests data into Cassandra reading from Kafka. I observed that when Cassandra service is down, it is not retrying for the property I set i.e. 10. Instead it is retrying with the

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Anastasios Zouzias
Are you sure that your JSON file has the right format? spark.read.json(...) expects a file where *each line is a json object*. My wild guess is that val hdf=spark.read.json("/user/tmp/hugedatafile") hdf.show(2) or hdf.take(1) gives OOM tries to fetch all the data into the driver. Can you

Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread raksja
I have a json file which is a continuous array of objects of similar type [{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed. This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a whole regular json file. [{"id":"1","entityMetadata":{"lastChange":"2018-05-11

Re: Writing custom Structured Streaming receiver

2018-06-05 Thread alz2
I'm implementing a simple Structured Streaming Source with the V2 API in Java. I've taken the Offset logic (regarding startOffset, endOffset, lastCommittedOffset, etc) from the socket source and also your receivers. However, upon start up for some reason Spark says that the initial offset or -1,

Using checkpoint much, much faster than cache. Why?

2018-06-05 Thread Phillip Henry
Hi, folks. I am using Spark 2.2.0 and a combination of Spark ML's LinearSVC and OneVsRest to classify some documents that are originally read from HDFS using sc.wholeTextFiles. When I use it on small documents, I get the results in a few minutes. When I use it on the same number of large

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
"dependent" I mean this batch's job relies on the previous batch's result. So this batch should wait for the finish of previous batch, if you set " spark.streaming.concurrentJobs" larger than 1, then the current batch could start without waiting for the previous batch (if it is delayed), which

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
On 05/06/2018 13:44, Saisai Shao wrote: You need to read the code, this is an undocumented configuration. I'm on it right now, but, Spark is a big piece of software. Basically this will break the ordering of Streaming jobs, AFAIK it may get unexpected results if you streaming jobs are not

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
You need to read the code, this is an undocumented configuration. Basically this will break the ordering of Streaming jobs, AFAIK it may get unexpected results if you streaming jobs are not independent. thomas lavocat 于2018年6月5日周二 下午7:17写道: > Hello, > > Thank's for your answer. > > On

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
Hello, Thank's for your answer. On 05/06/2018 11:24, Saisai Shao wrote: spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're

Strange codegen error for SortMergeJoin in Spark 2.2.1

2018-06-05 Thread Rico Bergmann
Hi! I get a strange error when executing a complex SQL-query involving 4 tables that are left-outer-joined: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 18: failed to compile: org.codehaus.commons.compiler.CompileException: File

Reg:- Py4JError in Windows 10 with Spark

2018-06-05 Thread @Nandan@
Hi , I am getting error :- --- Py4JError Traceback (most recent call last) in () 3 TOTAL = 100 4 dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range( TOTAL)]).cache() > 5 print("Number of random

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this

is there a way to parse and modify raw spark sql query?

2018-06-05 Thread kant kodali
Hi All, is there a way to parse and modify raw spark sql query? For example, given the following query spark.sql("select hello from view") I want to modify the query or logical plan such that I can get the result equivalent to the below query. spark.sql("select foo, hello from view") Any

Re: Help explaining explain() after DataFrame join reordering

2018-06-05 Thread Matteo Cossu
Hello, as explained here , the join order can be changed by the optimizer. The difference introduced in Spark 2.2 is that the reordering is based on statistics instead of heuristics, that can appear "random"

[Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
Hi everyone, I'm wondering if the property  spark.streaming.concurrentJobs should reflects the total number of possible concurrent task on the cluster, or the a local number of concurrent tasks on one compute node. Thanks for your help. Thomas

Re: spark partitionBy with partitioned column in json output

2018-06-05 Thread Elior Malul
Had the same issue my self. I was surprised at first as well, but I found it useful - the amount of data saved for each partition has decreased. When I load the data from each partition, I add the partitioned columns with lit function before I merge the frames from the different partitions. On