Smarter views

2017-05-25 Thread Maurin Lenglart
Hi, I am working on some big data related technology. And I am trying to get a sense on how hard will it be to enhance the views in impala so whenever someone query a view, not all the columns of that view are computed but only the necessary columns for that particular query. A simple exemple

Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
ically be possible to handle this in Spark but you'll probably have a > better time handling duplicates in the service that reads from Kafka. > > On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart <mau...@cuberonlabs.com> > wrote: >> >> Hi

Spark streaming to kafka exactly once

2017-03-22 Thread Maurin Lenglart
Hi, we are trying to build a spark streaming solution that subscribe and push to kafka. But we are running into the problem of duplicates events. Right now, I am doing a “forEachRdd” and loop over the message of each partition and send those message to kafka. Is there any good way of solving

.tar.bz2 in spark

2016-12-08 Thread Maurin Lenglart
Hi, I am trying to load a json file compress in .tar.bz2 but spark throw an error. I am using pyspark with spark 1.6.2. (Cloudera 5.9) What will be the best way to handle that? I don’t want to have a non-spark job that will just uncompressed the data… thanks

SizeEstimator for python

2016-08-15 Thread Maurin Lenglart
Hi, Is there a way to estimate the size of a dataframe in python? Something similar to https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/util/SizeEstimator.html ? thanks

dynamic coalesce to pick file size

2016-07-26 Thread Maurin Lenglart
Hi, I am doing a Sql query that return a Dataframe. Then I am writing the result of the query using “df.write”, but the result get written in a lot of different small files (~100 of 200 ko). So now I am doing a “.coalesce(2)” before the write. But the number “2” that I picked is static, is

Re: Spark Website

2016-07-13 Thread Maurin Lenglart
Same here From: Benjamin Kim Date: Wednesday, July 13, 2016 at 11:47 AM To: manish ranjan Cc: user Subject: Re: Spark Website It takes me to the directories instead of the webpage. On Jul 13, 2016, at 11:45 AM, manish ranjan

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
_date` LIMIT 2”) take 8 seconds. thanks From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Sunday, April 17, 2016 at 2:52 PM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark"

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
gmail.com<mailto:mich.talebza...@gmail.com>> Date: Sunday, April 17, 2016 at 2:22 PM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggrega

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
ich.talebza...@gmail.com>> Date: Saturday, April 16, 2016 at 4:14 AM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>, "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, o

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
using the latest release of cloudera and I didn’t modified any version. Do you think that I should try to manually update hive ? thanks From: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> Date: Saturday, April 16, 2016 at 1:02 AM To: maurin lenglart <mau...@

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
k you for your answer. From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Saturday, April 16, 2016 at 12:32 AM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark" <user@spark.apache.

orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi, I am executing one query : “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2” My table was created something like : CREATE

Re: alter table add columns aternatives or hive refresh

2016-04-15 Thread Maurin Lenglart
he df * Then I use df.insertInto myTable I also migrated for parquet to ORC, not sure if this have an impact or not. Thanks you for our help. From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Sunday, April 10, 2016 at 11:54 PM To: mau

Re: alter table add columns aternatives or hive refresh

2016-04-11 Thread Maurin Lenglart
I will try that during the next w-e. Thanks you for your answers. From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Sunday, April 10, 2016 at 11:54 PM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart
bzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 10 April 2016 at 19:34, Maurin Lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote: Hi, So basically you are telling me that I need to recreate a table, and re-insert everything every time I upda

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart
options that will allow me not to move TB of data everyday? Thanks for you answer From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Sunday, April 10, 2016 at 3:41 AM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonla

alter table add columns aternatives or hive refresh

2016-04-09 Thread Maurin Lenglart
Hi, I am trying to add columns to table that I created with the “saveAsTable” api. I update the columns using sqlContext.sql(‘alter table myTable add columns (mycol string)’). The next time I create a df and save it in the same table, with the new columns I get a : “ParquetRelation requires

Re: Sample sql query using pyspark

2016-03-01 Thread Maurin Lenglart
ple.groupBy("Category").agg(sum("bookings"), sum("dealviews”)) Thanks for your answer. From: James Barney <jamesbarne...@gmail.com<mailto:jamesbarne...@gmail.com>> Date: Tuesday, March 1, 2016 at 7:01 AM To: maurin lenglart <mau...@cuberonlabs.com<mail

Sample sql query using pyspark

2016-03-01 Thread Maurin Lenglart
Hi, I am trying to get a sample of a sql query in to make the query run faster. My query look like this : SELECT `Category` as `Category`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM groupon_dropbox WHERE `event_date` >= '2015-11-14' AND `event_date` <= '2016-02-19' GROUP

_metada file throwing an "GC overhead limit exceeded" after a write

2016-02-12 Thread Maurin Lenglart
Hi, I am currently using spark in python. I have my master, worker and driver on the same machine in different dockers. I am using spark 1.6. The configuration that I am using look like this : CONFIG["spark.executor.memory"] = "100g" CONFIG["spark.executor.cores"] = "11"