Re: Reading multiple json files form nested folders for data frame

Gourav Sengupta Thu, 21 Jul 2016 19:05:40 -0700

If you are using EMR, please try their latest release, there will be very
few reasons left for using SPARK ever at all (particularly given that
hiveContext rides a lot on HIVE) if you are using SQL.


Just over regular csv data I have seen Hive on TEZ performance gains by
100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC
 the performance gains are super fast (query 64 million rows x 570 columns
in 54 seconds) and with proper partitioning and indexing in ORC its blazing
fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps
a reason why SPARK makes things slow while using ORC :)


Regards,
Gourav

On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar <kmr.ashutos...@gmail.com>
wrote:

> It works. Is it better to have hive in this case for better performance ?
>
> On Thu, Jul 21, 2016 at 12:30 PM, Simone <simone.mirag...@gmail.com>
> wrote:
>
>> If you have a folder, and a bunch of json inside that folder- yes it
>> should work. Just set as path something like "path/to/your/folder/*.json"
>> All files will be loaded into a dataframe and schema will be the union of
>> all the different schemas of your json files (only if you have different
>> schemas)
>> It should work - let me know
>>
>> Simone Miraglia
>> ------------------------------
>> Da: Ashutosh Kumar <kmr.ashutos...@gmail.com>
>> Inviato: ‎21/‎07/‎2016 08:55
>> A: Simone <simone.mirag...@gmail.com>; user @spark
>> <user@spark.apache.org>
>> Oggetto: Re: Reading multiple json files form nested folders for data
>> frame
>>
>> That example points to a particular json file. Will it work same way if I
>> point to top level folder containing all json files ?
>>
>> On Thu, Jul 21, 2016 at 12:04 PM, Simone <simone.mirag...@gmail.com>
>> wrote:
>>
>>> Yes you can - have a look here
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>
>>> Hope it helps
>>>
>>> Simone Miraglia
>>> ------------------------------
>>> Da: Ashutosh Kumar <kmr.ashutos...@gmail.com>
>>> Inviato: ‎21/‎07/‎2016 08:19
>>> A: user @spark <user@spark.apache.org>
>>> Oggetto: Reading multiple json files form nested folders for data frame
>>>
>>> I need to read bunch of json files kept in date wise folders and perform
>>> sql queries on them using data frame. Is it possible to do so? Please
>>> provide some pointers .
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>

Re: Reading multiple json files form nested folders for data frame

Reply via email to