Re: How to increase the Json parsing speed

2015-08-28 Thread Sabarish Sasidharan
How many executors are you using when using Spark SQL? On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: I see that you are not reusing the same mapper instance in the Scala snippet. Regards Sab On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue

Re: How to increase the Json parsing speed

2015-08-28 Thread Ewan Higgs
Hi Gavin, You can increase the speed by choosing a better encoding. A little bit of ETL goes a long way. e.g. As you're working with Spark SQL you probably have a tabular format. So you could use CSV so you don't need to parse the field names on each entry (and it will also reduce the file

Re: How to increase the Json parsing speed

2015-08-28 Thread Gavin Yue
500 each with 8GB memory. I did the test again on the cluster. I have 6000 files which generates 6000 tasks. Each task takes 1.5 min to finish based on the Stats. So theoretically it should take 15 mins roughly. WIth some additinal overhead, it totally takes 18 mins. Based on the local file

RE: How to increase the Json parsing speed

2015-08-28 Thread Ewan Leith
[mailto:yue.yuany...@gmail.com] Sent: 28 August 2015 08:06 To: Sabarish Sasidharan sabarish.sasidha...@manthan.com Cc: user user@spark.apache.org Subject: Re: How to increase the Json parsing speed 500 each with 8GB memory. I did the test again on the cluster. I have 6000 files which generates 6000 tasks. Each

Re: How to increase the Json parsing speed

2015-08-27 Thread Gavin Yue
Just did some tests. I have 6000 files, each has 14K records with 900Mb file size. In spark sql, it would take one task roughly 1 min to parse. On the local machine, using the same Jackson lib inside Spark lib. Just parse it. FileInputStream fstream = new FileInputStream(testfile);