Thanks for response. I am using google cloud . I have couple of options . 1. I can go for spark and run sql queries using sqlcontext . 2. Use hive , As I understand , hive will have underlying engine spark . Is that correct ? Also my data is json and is highly nested . What do you suggest ?
Thanks Ashutosh On Fri, Jul 22, 2016 at 7:35 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > If you are using EMR, please try their latest release, there will be very > few reasons left for using SPARK ever at all (particularly given that > hiveContext rides a lot on HIVE) if you are using SQL. > > Just over regular csv data I have seen Hive on TEZ performance gains by > 100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC > the performance gains are super fast (query 64 million rows x 570 columns > in 54 seconds) and with proper partitioning and indexing in ORC its blazing > fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps > a reason why SPARK makes things slow while using ORC :) > > > Regards, > Gourav > > On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar <kmr.ashutos...@gmail.com > > wrote: > >> It works. Is it better to have hive in this case for better performance ? >> >> On Thu, Jul 21, 2016 at 12:30 PM, Simone <simone.mirag...@gmail.com> >> wrote: >> >>> If you have a folder, and a bunch of json inside that folder- yes it >>> should work. Just set as path something like "path/to/your/folder/*.json" >>> All files will be loaded into a dataframe and schema will be the union >>> of all the different schemas of your json files (only if you have different >>> schemas) >>> It should work - let me know >>> >>> Simone Miraglia >>> ------------------------------ >>> Da: Ashutosh Kumar <kmr.ashutos...@gmail.com> >>> Inviato: 21/07/2016 08:55 >>> A: Simone <simone.mirag...@gmail.com>; user @spark >>> <user@spark.apache.org> >>> Oggetto: Re: Reading multiple json files form nested folders for data >>> frame >>> >>> That example points to a particular json file. Will it work same way if >>> I point to top level folder containing all json files ? >>> >>> On Thu, Jul 21, 2016 at 12:04 PM, Simone <simone.mirag...@gmail.com> >>> wrote: >>> >>>> Yes you can - have a look here >>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets >>>> >>>> Hope it helps >>>> >>>> Simone Miraglia >>>> ------------------------------ >>>> Da: Ashutosh Kumar <kmr.ashutos...@gmail.com> >>>> Inviato: 21/07/2016 08:19 >>>> A: user @spark <user@spark.apache.org> >>>> Oggetto: Reading multiple json files form nested folders for data frame >>>> >>>> I need to read bunch of json files kept in date wise folders and >>>> perform sql queries on them using data frame. Is it possible to do so? >>>> Please provide some pointers . >>>> >>>> Thanks >>>> Ashutosh >>>> >>> >>> >> >