Hi,
I am currently using spark in python. I have my master, worker and driver on
the same machine in different dockers. I am using spark 1.6.
The configuration that I am using look like this :
CONFIG["spark.executor.memory"] = "100g"
CONFIG["spark.executor.cores"] = "11"
ple.groupBy("Category").agg(sum("bookings"), sum("dealviews”))
Thanks for your answer.
From: James Barney <jamesbarne...@gmail.com<mailto:jamesbarne...@gmail.com>>
Date: Tuesday, March 1, 2016 at 7:01 AM
To: maurin lenglart <mau...@cuberonlabs.com<mail
Hi,
I am trying to get a sample of a sql query in to make the query run faster.
My query look like this :
SELECT `Category` as `Category`,sum(`bookings`) as `bookings`,sum(`dealviews`)
as `dealviews` FROM groupon_dropbox WHERE `event_date` >= '2015-11-14' AND
`event_date` <= '2016-02-19' GROUP
Hi,
I am trying to add columns to table that I created with the “saveAsTable” api.
I update the columns using sqlContext.sql(‘alter table myTable add columns
(mycol string)’).
The next time I create a df and save it in the same table, with the new columns
I get a :
“ParquetRelation
requires
options that will allow me not to move TB of data everyday?
Thanks for you answer
From: Mich Talebzadeh
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 10, 2016 at 3:41 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonla
bzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 10 April 2016 at 19:34, Maurin Lenglart
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
So basically you are telling me that I need to recreate a table, and re-insert
everything every time I upda
I will try that during the next w-e.
Thanks you for your answers.
From: Mich Talebzadeh
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 10, 2016 at 11:54 PM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
he df
* Then I use df.insertInto myTable
I also migrated for parquet to ORC, not sure if this have an impact or not.
Thanks you for our help.
From: Mich Talebzadeh
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 10, 2016 at 11:54 PM
To: mau
Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >=
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”
My table was created something like :
CREATE
gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:22 PM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggrega
ich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 4:14 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>,
"user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, o
_date` LIMIT 2”) take 8 seconds.
thanks
From: Mich Talebzadeh
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:52 PM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark"
using the latest release of cloudera and I didn’t modified any version. Do
you think that I should try to manually update hive ?
thanks
From: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Date: Saturday, April 16, 2016 at 1:02 AM
To: maurin lenglart <mau...@
k you for your answer.
From: Mich Talebzadeh
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 12:32 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.
Hi,
I am doing a Sql query that return a Dataframe. Then I am writing the result of
the query using “df.write”, but the result get written in a lot of different
small files (~100 of 200 ko). So now I am doing a “.coalesce(2)” before the
write.
But the number “2” that I picked is static, is
Same here
From: Benjamin Kim
Date: Wednesday, July 13, 2016 at 11:47 AM
To: manish ranjan
Cc: user
Subject: Re: Spark Website
It takes me to the directories instead of the webpage.
On Jul 13, 2016, at 11:45 AM, manish ranjan
Hi,
Is there a way to estimate the size of a dataframe in python?
Something similar to
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/util/SizeEstimator.html
?
thanks
Hi,
I am trying to load a json file compress in .tar.bz2 but spark throw an error.
I am using pyspark with spark 1.6.2. (Cloudera 5.9)
What will be the best way to handle that?
I don’t want to have a non-spark job that will just uncompressed the data…
thanks
Hi,
we are trying to build a spark streaming solution that subscribe and push to
kafka.
But we are running into the problem of duplicates events.
Right now, I am doing a “forEachRdd” and loop over the message of each
partition and send those message to kafka.
Is there any good way of solving
ically be possible to handle this in Spark but you'll probably have a
> better time handling duplicates in the service that reads from Kafka.
>
> On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart <mau...@cuberonlabs.com>
> wrote:
>>
>> Hi
20 matches
Mail list logo