Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
Thanks for response. I am using google cloud . I have couple of options .
1. I can go for spark and run sql queries using sqlcontext .
2. Use hive ,
As I understand , hive will have underlying engine spark . Is that correct
?
Also my data is json and is highly nested .
What do you suggest ?

Thanks
Ashutosh

On Fri, Jul 22, 2016 at 7:35 AM, Gourav Sengupta 
wrote:

> If you are using EMR, please try their latest release, there will be very
> few reasons left for using SPARK ever at all (particularly given that
> hiveContext rides a lot on HIVE) if you are using SQL.
>
> Just over regular csv data I have seen Hive on TEZ performance gains by
> 100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC
>  the performance gains are super fast (query 64 million rows x 570 columns
> in 54 seconds) and with proper partitioning and indexing in ORC its blazing
> fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps
> a reason why SPARK makes things slow while using ORC :)
>
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar  > wrote:
>
>> It works. Is it better to have hive in this case for better performance ?
>>
>> On Thu, Jul 21, 2016 at 12:30 PM, Simone 
>> wrote:
>>
>>> If you have a folder, and a bunch of json inside that folder- yes it
>>> should work. Just set as path something like "path/to/your/folder/*.json"
>>> All files will be loaded into a dataframe and schema will be the union
>>> of all the different schemas of your json files (only if you have different
>>> schemas)
>>> It should work - let me know
>>>
>>> Simone Miraglia
>>> --------------
>>> Da: Ashutosh Kumar 
>>> Inviato: ‎21/‎07/‎2016 08:55
>>> A: Simone ; user @spark
>>> 
>>> Oggetto: Re: Reading multiple json files form nested folders for data
>>> frame
>>>
>>> That example points to a particular json file. Will it work same way if
>>> I point to top level folder containing all json files ?
>>>
>>> On Thu, Jul 21, 2016 at 12:04 PM, Simone 
>>> wrote:
>>>
>>>> Yes you can - have a look here
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>>
>>>> Hope it helps
>>>>
>>>> Simone Miraglia
>>>> --
>>>> Da: Ashutosh Kumar 
>>>> Inviato: ‎21/‎07/‎2016 08:19
>>>> A: user @spark 
>>>> Oggetto: Reading multiple json files form nested folders for data frame
>>>>
>>>> I need to read bunch of json files kept in date wise folders and
>>>> perform sql queries on them using data frame. Is it possible to do so?
>>>> Please provide some pointers .
>>>>
>>>> Thanks
>>>> Ashutosh
>>>>
>>>
>>>
>>
>


Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Gourav Sengupta
If you are using EMR, please try their latest release, there will be very
few reasons left for using SPARK ever at all (particularly given that
hiveContext rides a lot on HIVE) if you are using SQL.

Just over regular csv data I have seen Hive on TEZ performance gains by
100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC
 the performance gains are super fast (query 64 million rows x 570 columns
in 54 seconds) and with proper partitioning and indexing in ORC its blazing
fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps
a reason why SPARK makes things slow while using ORC :)


Regards,
Gourav

On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar 
wrote:

> It works. Is it better to have hive in this case for better performance ?
>
> On Thu, Jul 21, 2016 at 12:30 PM, Simone 
> wrote:
>
>> If you have a folder, and a bunch of json inside that folder- yes it
>> should work. Just set as path something like "path/to/your/folder/*.json"
>> All files will be loaded into a dataframe and schema will be the union of
>> all the different schemas of your json files (only if you have different
>> schemas)
>> It should work - let me know
>>
>> Simone Miraglia
>> --
>> Da: Ashutosh Kumar 
>> Inviato: ‎21/‎07/‎2016 08:55
>> A: Simone ; user @spark
>> 
>> Oggetto: Re: Reading multiple json files form nested folders for data
>> frame
>>
>> That example points to a particular json file. Will it work same way if I
>> point to top level folder containing all json files ?
>>
>> On Thu, Jul 21, 2016 at 12:04 PM, Simone 
>> wrote:
>>
>>> Yes you can - have a look here
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>
>>> Hope it helps
>>>
>>> Simone Miraglia
>>> --
>>> Da: Ashutosh Kumar 
>>> Inviato: ‎21/‎07/‎2016 08:19
>>> A: user @spark 
>>> Oggetto: Reading multiple json files form nested folders for data frame
>>>
>>> I need to read bunch of json files kept in date wise folders and perform
>>> sql queries on them using data frame. Is it possible to do so? Please
>>> provide some pointers .
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>


Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
It works. Is it better to have hive in this case for better performance ?

On Thu, Jul 21, 2016 at 12:30 PM, Simone  wrote:

> If you have a folder, and a bunch of json inside that folder- yes it
> should work. Just set as path something like "path/to/your/folder/*.json"
> All files will be loaded into a dataframe and schema will be the union of
> all the different schemas of your json files (only if you have different
> schemas)
> It should work - let me know
>
> Simone Miraglia
> --
> Da: Ashutosh Kumar 
> Inviato: ‎21/‎07/‎2016 08:55
> A: Simone ; user @spark 
> Oggetto: Re: Reading multiple json files form nested folders for data
> frame
>
> That example points to a particular json file. Will it work same way if I
> point to top level folder containing all json files ?
>
> On Thu, Jul 21, 2016 at 12:04 PM, Simone 
> wrote:
>
>> Yes you can - have a look here
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>
>> Hope it helps
>>
>> Simone Miraglia
>> --
>> Da: Ashutosh Kumar 
>> Inviato: ‎21/‎07/‎2016 08:19
>> A: user @spark 
>> Oggetto: Reading multiple json files form nested folders for data frame
>>
>> I need to read bunch of json files kept in date wise folders and perform
>> sql queries on them using data frame. Is it possible to do so? Please
>> provide some pointers .
>>
>> Thanks
>> Ashutosh
>>
>
>


Re: Reading multiple json files form nested folders for data frame

2016-07-20 Thread Ashutosh Kumar
That example points to a particular json file. Will it work same way if I
point to top level folder containing all json files ?

On Thu, Jul 21, 2016 at 12:04 PM, Simone  wrote:

> Yes you can - have a look here
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> Hope it helps
>
> Simone Miraglia
> --
> Da: Ashutosh Kumar 
> Inviato: ‎21/‎07/‎2016 08:19
> A: user @spark 
> Oggetto: Reading multiple json files form nested folders for data frame
>
> I need to read bunch of json files kept in date wise folders and perform
> sql queries on them using data frame. Is it possible to do so? Please
> provide some pointers .
>
> Thanks
> Ashutosh
>


Re: Reading multiple json files form nested folders for data frame

2016-07-20 Thread Ashutosh Kumar
There is no database . I read files from google cloud storage /S3/hdfs.

Thanks
Ashutosh

On Thu, Jul 21, 2016 at 11:50 AM, Sree Eedupuganti  wrote:

> Database you are using ?
>