Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Sorry, sent early, wasn't finished typing.

CREATE EXTERNAL TABLE 

Then we can select the data using Impala.  But this is registered as an
external table and must be refreshed if new data is inserted.

Obviously this doesn't seem good and doesn't seem like the correct solution.

How should we insert data from SparkSQL into a Parquet table which can be
directly queried by Impala?

Best regards,
Patrick


On 1 August 2014 16:18, Patrick McGloin  wrote:

> Hi,
>
> We would like to use Spark SQL to store data in Parquet format and then
> query that data using Impala.
>
> We've tried to come up with a solution and it is working but it doesn't
> seem good.  So I was wondering if you guys could tell us what is the
> correct way to do this.  We are using Spark 1.0 and Impala 1.3.1.
>
> First we are registering our tables using SparkSQL:
>
> val sqlContext = new SQLContext(sc)
> sqlContext.createParquetFile[ParqTable]("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt",
> true)
>
> Then we are using the HiveContext to register the table and do the insert:
>
> val hiveContext = new HiveContext(sc)
> import hiveContext._
>
> hiveContext.parquetFile("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt").registerAsTable("ParqTable")
> eventsDStream.foreachRDD(event=>event.insertInto("ParqTable"))
>
> Now we have the data stored in a Parquet file.  To access it in Hive or
> Impala we run
>
>


Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Michael Armbrust
So is the only issue that impala does not see changes until you refresh the
table?  This sounds like a configuration that needs to be changed on the
impala side.


On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin 
wrote:

> Sorry, sent early, wasn't finished typing.
>
> CREATE EXTERNAL TABLE 
>
> Then we can select the data using Impala.  But this is registered as an
> external table and must be refreshed if new data is inserted.
>
> Obviously this doesn't seem good and doesn't seem like the correct
> solution.
>
> How should we insert data from SparkSQL into a Parquet table which can be
> directly queried by Impala?
>
> Best regards,
> Patrick
>
>
> On 1 August 2014 16:18, Patrick McGloin  wrote:
>
>> Hi,
>>
>> We would like to use Spark SQL to store data in Parquet format and then
>> query that data using Impala.
>>
>> We've tried to come up with a solution and it is working but it doesn't
>> seem good.  So I was wondering if you guys could tell us what is the
>> correct way to do this.  We are using Spark 1.0 and Impala 1.3.1.
>>
>> First we are registering our tables using SparkSQL:
>>
>> val sqlContext = new SQLContext(sc)
>> sqlContext.createParquetFile[ParqTable]("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt",
>> true)
>>
>> Then we are using the HiveContext to register the table and do the insert:
>>
>> val hiveContext = new HiveContext(sc)
>> import hiveContext._
>>
>> hiveContext.parquetFile("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt").registerAsTable("ParqTable")
>> eventsDStream.foreachRDD(event=>event.insertInto("ParqTable"))
>>
>> Now we have the data stored in a Parquet file.  To access it in Hive or
>> Impala we run
>>
>>
>


Re: Spark SQL, Parquet and Impala

2014-08-02 Thread Patrick McGloin
Hi Michael,

Thanks for your reply.  Is this the correct way to load data from Spark
into Parquet?  Somehow it doesn't feel right.  When we followed the steps
described for storing the data into Hive tables everything was smooth, we
used HiveContext and the table is automatically recognised by Hive (and
Impala).

When we loaded the data into Parquet using the method I described we used
both SQLContext and HiveContext.  We had to manually define the table using
the CREATE EXTERNAL in Hive.  Then we have to refresh to see changes.

So the problem isn't just the refresh, its that we're unsure of the best
practice for loading data into Parquet tables.  Is the way we are doing the
Spark part correct in your opinion?

Best regards,
Patrick






On 1 August 2014 19:32, Michael Armbrust  wrote:

> So is the only issue that impala does not see changes until you refresh
> the table?  This sounds like a configuration that needs to be changed on
> the impala side.
>
>
> On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin  > wrote:
>
>> Sorry, sent early, wasn't finished typing.
>>
>> CREATE EXTERNAL TABLE 
>>
>> Then we can select the data using Impala.  But this is registered as an
>> external table and must be refreshed if new data is inserted.
>>
>> Obviously this doesn't seem good and doesn't seem like the correct
>> solution.
>>
>> How should we insert data from SparkSQL into a Parquet table which can be
>> directly queried by Impala?
>>
>> Best regards,
>> Patrick
>>
>>
>> On 1 August 2014 16:18, Patrick McGloin 
>> wrote:
>>
>>> Hi,
>>>
>>> We would like to use Spark SQL to store data in Parquet format and then
>>> query that data using Impala.
>>>
>>> We've tried to come up with a solution and it is working but it doesn't
>>> seem good.  So I was wondering if you guys could tell us what is the
>>> correct way to do this.  We are using Spark 1.0 and Impala 1.3.1.
>>>
>>> First we are registering our tables using SparkSQL:
>>>
>>> val sqlContext = new SQLContext(sc)
>>> sqlContext.createParquetFile[ParqTable]("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt",
>>> true)
>>>
>>> Then we are using the HiveContext to register the table and do the
>>> insert:
>>>
>>> val hiveContext = new HiveContext(sc)
>>> import hiveContext._
>>>
>>> hiveContext.parquetFile("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt").registerAsTable("ParqTable")
>>> eventsDStream.foreachRDD(event=>event.insertInto("ParqTable"))
>>>
>>> Now we have the data stored in a Parquet file.  To access it in Hive or
>>> Impala we run
>>>
>>>
>>
>


RE: Spark SQL, Parquet and Impala

2014-08-02 Thread Andrew Lee
Hi Patrick,
In Impala 131, when you update tables and metadata, do you still need to run 
'invalidate metadata' in impala-shell? My understanding is that it is a pull 
architecture to refresh the metastore on the catalogd in Impala, not sure if 
this still applies to this case since you are updating the Hive metastore data 
when creating the external tables.
If the 'invalidate metadata' still applies, I would point this to an Impala 
problem since HiveContext is passive and depends on when and who invoke the 
command. The underneath driver is still Hiveserver2 to Hive (I haven't looked 
into the Spark code, not sure if they are using the ql.Driver class, however, 
I'm assuming it is HS2 here in Spark) where Impala needs to fetch the metadata 
from Hive-metastore. HiveContext should update the Hive-metastore when you 
create the table, but this doesn't mean it will trigger Impala's catalogd to 
pull in the latest metadata which is cached on catalogd.
This is probably not a Parquet related answers but more of the background how 
Impala works with Hive, and how Spark updates data into Hive?
AL

Date: Sat, 2 Aug 2014 10:30:27 +0200
Subject: Re: Spark SQL, Parquet and Impala
From: mcgloin.patr...@gmail.com
To: user@spark.apache.org

Hi Michael,
Thanks for your reply.  Is this the correct way to load data from Spark into 
Parquet?  Somehow it doesn't feel right.  When we followed the steps described 
for storing the data into Hive tables everything was smooth, we used 
HiveContext and the table is automatically recognised by Hive (and Impala).

When we loaded the data into Parquet using the method I described we used both 
SQLContext and HiveContext.  We had to manually define the table using the 
CREATE EXTERNAL in Hive.  Then we have to refresh to see changes.

So the problem isn't just the refresh, its that we're unsure of the best 
practice for loading data into Parquet tables.  Is the way we are doing the 
Spark part correct in your opinion?

Best regards,Patrick





On 1 August 2014 19:32, Michael Armbrust  wrote:

So is the only issue that impala does not see changes until you refresh the 
table?  This sounds like a configuration that needs to be changed on the impala 
side.





On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin  
wrote:




Sorry, sent early, wasn't finished typing.
CREATE EXTERNAL TABLE 

Then we can select the data using Impala.  But this is registered as an 
external table and must be refreshed if new data is inserted.





Obviously this doesn't seem good and doesn't seem like the correct solution.
How should we insert data from SparkSQL into a Parquet table which can be 
directly queried by Impala?





Best regards,Patrick

On 1 August 2014 16:18, Patrick McGloin  wrote:





Hi,
We would like to use Spark SQL to store data in Parquet format and then query 
that data using Impala.





We've tried to come up with a solution and it is working but it doesn't seem 
good.  So I was wondering if you guys could tell us what is the correct way to 
do this.  We are using Spark 1.0 and Impala 1.3.1.






First we are registering our tables using SparkSQL:
val sqlContext = new 
SQLContext(sc)sqlContext.createParquetFile[ParqTable]("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt",
 true)







Then we are using the HiveContext to register the table and do the insert:
val hiveContext = new HiveContext(sc)import 
hiveContext._hiveContext.parquetFile("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt").registerAsTable("ParqTable")





eventsDStream.foreachRDD(event=>event.insertInto("ParqTable"))
Now we have the data stored in a Parquet file.  To access it in Hive or Impala 
we run