Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Stéphane Verlet
I had that issue too and from what I gathered, it is an expected optimization... Try using repartiion instead ⁣Get BlueMail for Android ​ On Feb 3, 2021, 11:55, at 11:55, James Yu wrote: >Hi Team, > >We are running into this poor performance issue and seeking your >suggestion on how to improve

Re: Java Rdd of String to dataframe

2017-10-12 Thread Stéphane Verlet
you can specify the schema programmatically https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema On Wed, Oct 11, 2017 at 3:35 PM, sk skk wrote: > Can we create a dataframe from a Java pair rdd of String . I don’t have a >

Re: Spark job taking 10s to allocate executors and memory before submitting job

2017-09-28 Thread Stéphane Verlet
Sounds like such a small job , if you running in on a cluster have you consider simply running it locally (master = local) ? On Wed, Sep 27, 2017 at 7:06 AM, navneet sharma wrote: > Hi, > > I am running spark job taking total 18s, in that 8 seconds for actual >

Re: Running Hive and Spark together with Dynamic Resource Allocation

2016-10-28 Thread Stéphane Verlet
This works for us yarn.nodemanager.aux-services mapreduce_shuffle,spark_shuffle yarn.nodemanager.aux-services.mapreduce_shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.nodemanager.aux-services.spark_shuffle.class

Re: rdd split into new rdd

2015-12-23 Thread Stéphane Verlet
You should be able to do that using mapPartition On Wed, Dec 23, 2015 at 8:24 AM, Ted Yu wrote: > bq. {a=1, b=1, c=2, d=2} > > Can you elaborate your criteria a bit more ? The above seems to be a Set, > not a Map. > > Cheers > > On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya

Re: rdd split into new rdd

2015-12-23 Thread Stéphane Verlet
How can i use mapPartion? Could u give me an example? > > 2015-12-23 17:26 GMT+02:00 Stéphane Verlet <kaweahsoluti...@gmail.com>: > >> You should be able to do that using mapPartition >> >> On Wed, Dec 23, 2015 at 8:24 AM, Ted Yu <yuzhih...@gmail.com>

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
ad pool. So not sure why > killing the app in spark UI doesn't kill the process launched via script > > > On Friday, November 20, 2015, Stéphane Verlet <kaweahsoluti...@gmail.com> > wrote: > >> I solved the first issue by adding a shutdown hook in my code. The >> shu

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
I solved the first issue by adding a shutdown hook in my code. The shutdown hook get call when you exit your script (ctrl-C , kill … but nor kill -9) val shutdownHook = scala.sys.addShutdownHook { try { sparkContext.stop() //Make sure to kill any other threads or thread pool you may be

Re: PairRDD from SQL

2015-11-04 Thread Stéphane Verlet
sqlContext.sql().map(row=> ((row.getString(0), row.getString(1)),row.getInt(2))) On Wed, Nov 4, 2015 at 1:44 PM, pratik khadloya wrote: > Hello, > > Is it possible to have a pair RDD from the below SQL query. > The pair being ((item_id, flight_id), metric1) > > item_id,

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread Stéphane Verlet
From your pseudo code, it would be sequential and done twice 1+2+3 then 1+2+4 If you do a .cache() in step 2 then you would have 1+2+3 , then 4 I ran several steps in parrallel from the same program but never using the same source RDD so I do not know the limitations there. I simply started

Re: SQL query in scala API

2014-12-04 Thread Stéphane Verlet
Disclaimer : I am new at Spark I did something similar in a prototype which works but I that did not test at scale yet val agg =3D users.mapValues(_ =3D 1)..aggregateByKey(new CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO= p) class CustomAggregation() extends

Re: Spark 1.1.0 Can not read snappy compressed sequence file

2014-12-04 Thread Stéphane Verlet
Yes , It is working with this in spark-env.sh export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native export

Spark 1.1.0 Can not read snappy compressed sequence file

2014-11-07 Thread Stéphane Verlet
I first saw this using SparkSQL but the result is the same with plain Spark. 14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z at