Re: How to add an accumulator for a Set in Spark

2016-03-19 Thread Adrien Mogenet
or-for-a-Set-in-Spark-tp26510p26514.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional

How does Spark set task indexes?

2016-05-24 Thread Adrien Mogenet
try to go further in our understanding on how does Spark behaves. We're using Spark 1.5.2, scala 2.11, on top of hadoop 2.6.0 -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquare.com http://www.contentsquare.com 50, avenue Montaigne - 75008 Paris

Re: How does Spark set task indexes?

2016-05-25 Thread Adrien Mogenet
gt; On Tue, May 24, 2016 at 1:00 PM, Adrien Mogenet < > adrien.moge...@contentsquare.com> wrote: > >> Hi, >> >> I'm wondering how Spark is setting the "index" of task? >> I'm asking this question because we have a job that constantly fails a

Unable to understand error “SparkListenerBus has already stopped! Dropping event …”

2015-09-02 Thread Adrien Mogenet
is totally non-deterministic and I can't reproduce this, probably due to the asynchronous nature and my lack of understand on how/when stop() is supposed to be called. Any idea? Best, -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquare.com (+33)6.59.16.64.22 http://w

Re: Parquet partitioning for unique identifier

2015-09-02 Thread Adrien Mogenet
ash value and add it as a separate column, >> but it doesn't sound right to me. Is there any other ways I can try ? >> >> Regards, >> -- >> Kohki Nishio >> > -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquare.com (+33)6.59.16.64.22 http://www.contentsquare.com 50, avenue Montaigne - 75008 Paris

Re: How to determine the value for spark.sql.shuffle.partitions?

2015-09-03 Thread Adrien Mogenet
erage/maximum record size >>4. cache configuration >>5. shuffle configuration >>6. serialization >>7. etc? >> >> Any general best practices? >> >> Thanks! >> >> Romi K. >> > > -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquare.com (+33)6.59.16.64.22 http://www.contentsquare.com 50, avenue Montaigne - 75008 Paris

Split content into multiple Parquet files

2015-09-07 Thread Adrien Mogenet
th Parquet files. The only working solution so far is to persist the RDD and then loop over it N times to write N files. That does not look acceptable... Do you guys have any suggestion to do such an operation? -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquar

Re: Split content into multiple Parquet files

2015-09-08 Thread Adrien Mogenet
eAsHadoopFile? > > Cheng > > > On 9/8/15 2:34 PM, Adrien Mogenet wrote: > > Hi there, > > We've spent several hours to split our input data into several parquet > files (or several folders, i.e. > /datasink/output-parquets//foobar.parquet), based on a > low

[POWERED BY] Please add our organization

2015-11-16 Thread Adrien Mogenet
Name: Content Square URL: http://www.contentsquare.com Description: We use Spark to regularly read raw data, convert them into Parquet, and process them to create advanced analytics dashboards: aggregation, sampling, statistics computations, anomaly detection, machine learning. -- *Adrien

Re: [POWERED BY] Please add our organization

2015-12-02 Thread Adrien Mogenet
Hi folks, You're probably busy, but any update on this? :) On 16 November 2015 at 16:04, Adrien Mogenet < adrien.moge...@contentsquare.com> wrote: > Name: Content Square > URL: http://www.contentsquare.com > > Description: > We use Spark to regularly read raw data,

Re: [POWERED BY] Please add our organization

2015-12-02 Thread Adrien Mogenet
wiki. > > On Wed, Dec 2, 2015 at 10:53 AM, Adrien Mogenet > wrote: > > Hi folks, > > > > You're probably busy, but any update on this? :) > > > > > > On 16 November 2015 at 16:04, Adrien Mogenet > > wrote: > >> > >> Name

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Adrien Mogenet
(CoarseGrainedSchedulerBackend.scala:283) >> at >> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:180) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:439) >> at >> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1439) >> at >> org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1724) >> at >> org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185) >> at org.apache.spark.SparkContext.stop(SparkContext.scala:1723) >> at >> org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:587) >> at >> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) >> at >> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) >> at >> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) >> at >> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) >> at >> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) >> at >> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) >> at >> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) >> at >> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) >> at scala.util.Try$.apply(Try.scala:161) >> at >> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) >> at >> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) >> at >> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) >> Caused by: akka.pattern.AskTimeoutException: >> Recipient[Actor[akka://sparkDriver/user/CoarseGrainedScheduler#1432624242]] >> had already been terminated. >> at >> akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:132) >> at >> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.ask(AkkaRpcEnv.scala:307) >> ... 23 more >> >> >> -- >> Donald Drake >> Drake Consulting >> http://www.drakeconsulting.com/ >> https://twitter.com/dondrake <http://www.MailLaunder.com/> >> 800-733-2143 >> > > > > -- > Donald Drake > Drake Consulting > http://www.drakeconsulting.com/ > https://twitter.com/dondrake <http://www.MailLaunder.com/> > 800-733-2143 > > > -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquare.com (+33)6.59.16.64.22 http://www.contentsquare.com 50, avenue Montaigne - 75008 Paris