Re: Reading multiple json files form nested folders for data frame

Ashutosh Kumar Thu, 21 Jul 2016 22:58:52 -0700

Thanks for response. I am using google cloud . I have couple of options .
1. I can go for spark and run sql queries using sqlcontext .
2. Use hive ,
As I understand , hive will have underlying engine spark . Is that correct
?
Also my data is json and is highly nested .
What do you suggest ?


Thanks
Ashutosh

On Fri, Jul 22, 2016 at 7:35 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> If you are using EMR, please try their latest release, there will be very
> few reasons left for using SPARK ever at all (particularly given that
> hiveContext rides a lot on HIVE) if you are using SQL.
>
> Just over regular csv data I have seen Hive on TEZ performance gains by
> 100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC
>  the performance gains are super fast (query 64 million rows x 570 columns
> in 54 seconds) and with proper partitioning and indexing in ORC its blazing
> fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps
> a reason why SPARK makes things slow while using ORC :)
>
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar <kmr.ashutos...@gmail.com
> > wrote:
>
>> It works. Is it better to have hive in this case for better performance ?
>>
>> On Thu, Jul 21, 2016 at 12:30 PM, Simone <simone.mirag...@gmail.com>
>> wrote:
>>
>>> If you have a folder, and a bunch of json inside that folder- yes it
>>> should work. Just set as path something like "path/to/your/folder/*.json"
>>> All files will be loaded into a dataframe and schema will be the union
>>> of all the different schemas of your json files (only if you have different
>>> schemas)
>>> It should work - let me know
>>>
>>> Simone Miraglia
>>> ------------------------------
>>> Da: Ashutosh Kumar <kmr.ashutos...@gmail.com>
>>> Inviato: ‎21/‎07/‎2016 08:55
>>> A: Simone <simone.mirag...@gmail.com>; user @spark
>>> <user@spark.apache.org>
>>> Oggetto: Re: Reading multiple json files form nested folders for data
>>> frame
>>>
>>> That example points to a particular json file. Will it work same way if
>>> I point to top level folder containing all json files ?
>>>
>>> On Thu, Jul 21, 2016 at 12:04 PM, Simone <simone.mirag...@gmail.com>
>>> wrote:
>>>
>>>> Yes you can - have a look here
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>>
>>>> Hope it helps
>>>>
>>>> Simone Miraglia
>>>> ------------------------------
>>>> Da: Ashutosh Kumar <kmr.ashutos...@gmail.com>
>>>> Inviato: ‎21/‎07/‎2016 08:19
>>>> A: user @spark <user@spark.apache.org>
>>>> Oggetto: Reading multiple json files form nested folders for data frame
>>>>
>>>> I need to read bunch of json files kept in date wise folders and
>>>> perform sql queries on them using data frame. Is it possible to do so?
>>>> Please provide some pointers .
>>>>
>>>> Thanks
>>>> Ashutosh
>>>>
>>>
>>>
>>
>

Re: Reading multiple json files form nested folders for data frame

Reply via email to