Re: GC overhead exceeded

2017-08-17 Thread Pralabh Kumar
what's is your exector memory , please share the code also On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > > HI, > > I am getting below error when running spark sql jobs. This error is thrown > after running 80% of tasks. any solution? > >

GC overhead exceeded

2017-08-17 Thread KhajaAsmath Mohammed
HI, I am getting below error when running spark sql jobs. This error is thrown after running 80% of tasks. any solution? spark.storage.memoryFraction=0.4 spark.sql.shuffle.partitions=2000 spark.default.parallelism=100 #spark.eventLog.enabled=false #spark.scheduler.revive.interval=1s

Spark Yarn mode - unsupported exception

2017-08-17 Thread Darshan Pandya
Hello Users, I am running into a spark issue "Unsupported major.minor version 52.0" The code I am trying to run is https://github.com/cpitman/spark-drools-example/ This code runs fine in spark local mode but fails horribly with the above exception when you submit the job in the yarn mode.

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
Hi Palwell, Tried doing that, but its becoming null for all the dates after the transformation with functions. df2 = dflead.select('Enter_Date',f.to_date(df2.Enter_Date)) [image: Inline image 1] Any insight? Thanks, Aakash. On Fri, Aug 18, 2017 at 12:23 AM, Patrick Alwell

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
Hey all, Thanks! I had a discussion with the person who authored that package and informed about this bug, but in the meantime with the same thing, found a small tweak to ensure the job is done. Now that is fine, I'm getting the date as a string by predefining the Schema but I want to later

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Hyukjin Kwon
For when multiLine is not set, we currently only support ascii-compatible encodings, up to my knowledge, mainly due to line separator and as I investigated in the comment. For when multiLine is set, it appears encoding is not considered. I actually meant encoding does not work at all in this case

Spark 2 | Java | Dataset

2017-08-17 Thread Jean Georges Perrin
Hey, I was wondering if it would make sense to have a Dataset of something else than Row? Does anyone has an example (in Java) or use case? My use case would be to use Spark on existing objects we have and benefit from the distributed processing on those objects. jg

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Han-Cheol Cho
Hi, Thank you for your response. I finally found the cause of this When multiLine option is set, input file is read by UnivocityParser.parseStream() method. This method, in turn, calls convertStream() that initializes tokenizer with tokenizer.beginParsing(inputStream) and parses records using

Working with hadoop har file in spark

2017-08-17 Thread Nicolas Paris
Hi I put million files into a har archive on hdfs. I d'like to iterate over their file paths, and read them. (Basically they are pdf, and I want to transform them into text with apache pdfbox) My first attempts has been to list them with hadoop command `hdfs dfs -ls har:///user//har/pdf.har`