Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Swapnil Shinde
Great news.. thank you very much! On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos < stavros.kontopou...@lightbend.com wrote: > Awesome! > > On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji wrote: > >> Indeed! >> >> Sent from my iPhone >> Pardon the dumb thumb typos :) >> >> On Nov 8, 2018, at 11:31

Spark scala development in Sbt vs Maven

2018-03-05 Thread Swapnil Shinde
Hello SBT's incremental compilation was a huge plus to build spark+scala applications in SBT for some time. It seems Maven can also support incremental compilation with Zinc server. Considering that, I am interested to know communities experience - 1. Spark documentation says SBT is being used

Minimum cost flow problem solving in Spark

2017-09-13 Thread Swapnil Shinde
Hello Has anyone used Spark to solve minimum cost flow problems in Spark? I am quite new to combinatorial optimization algorithms so any help or suggestions, libraries are very appreciated. Thanks Swapnil

Re: Inconsistent results with combineByKey API

2017-09-05 Thread Swapnil Shinde
Ping.. Can someone please correct me whether this is an issue or not. - Swapnil On Thu, Aug 31, 2017 at 12:27 PM, Swapnil Shinde <swapnilushi...@gmail.com> wrote: > Hello All > > I am observing some strange results with aggregateByKey API which is > implemented with comb

Inconsistent results with combineByKey API

2017-08-31 Thread Swapnil Shinde
Hello All I am observing some strange results with aggregateByKey API which is implemented with combineByKey. Not sure if this is by design or bug - I created this toy example but same problem can be observed on large datasets as well - *case class ABC(key: Int, c1: Int, c2: Int)* *case class

CSV output with JOBUUID

2017-05-10 Thread Swapnil Shinde
Hello I am using spark-2.0.1 and saw that CSV fileformat stores output with JOBUUID in it. https://github.com/apache/spark/blob/v2.0.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala#L191 I want to avoid csv writing JOBUUID in it. Is there any property

Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-08 Thread Swapnil Shinde
-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>: > >> Hello all >>I have a spark job that reads parquet data and partition it based on >> one of the columns. I made sure partitions equally distributed and not >> skewed. My code looks like this - >>

Huge partitioning job takes longer to close after all tasks finished

2017-03-07 Thread Swapnil Shinde
Hello all I have a spark job that reads parquet data and partition it based on one of the columns. I made sure partitions equally distributed and not skewed. My code looks like this - datasetA.write.partitonBy("column1").parquet(outputPath) Execution plan - [image: Inline image 1] All

Spark shuffle: FileNotFound exception

2016-12-03 Thread Swapnil Shinde
Hello All I am facing FileNotFoundException for shuffle index file when running job with large data. Same job runs fine with smaller datasets. These our my cluster specifications - No of nodes - 19 Total cores - 380 Memory per executor - 32G Spark 1.6 mapr version

Re: Dataframe broadcast join hint not working

2016-11-26 Thread Swapnil Shinde
f it is broadcast join, > you will see it in explain. > > On Sat, Nov 26, 2016 at 10:51 AM, Swapnil Shinde <swapnilushi...@gmail.com > > wrote: > >> Hello >> I am trying a broadcast join on dataframes but it is still doing >> SortMergeJoin. I even tr

Dataframe broadcast join hint not working

2016-11-26 Thread Swapnil Shinde
Hello I am trying a broadcast join on dataframes but it is still doing SortMergeJoin. I even try setting spark.sql.autoBroadcastJoinThreshold higher but still no luck. Related piece of code- val c = a.join(braodcast(b), "id") On a side note, if I do SizeEstimator.estimate(b) and it

No plan for broadcastHint

2015-10-02 Thread Swapnil Shinde
Hello I am trying to do inner join with broadcastHint and getting below exception - I tried to increase "sqlContext.conf.autoBroadcastJoinThreshold" but still no luck. *Code snippet-* val dpTargetUvOutput = pweCvfMUVDist.as("a").join(broadcast(sourceAssgined.as("b")), $"a.web_id" ===

Re: Spark driver locality

2015-08-28 Thread Swapnil Shinde
if I am wrong. On Fri, Aug 28, 2015 at 1:12 AM, Swapnil Shinde swapnilushi...@gmail.com wrote: Thanks Rishitesh !! 1. I get that driver doesn't need to be on master but there is lot of communication between driver and cluster. That's why co-located gateway was recommended. How much

Spark driver locality

2015-08-27 Thread Swapnil Shinde
Hello I am new to spark world and started to explore recently in standalone mode. It would be great if I get clarifications on below doubts- 1. Driver locality - It is mentioned in documentation that client deploy-mode is not good if machine running spark-submit is not co-located with worker

Re: Spark driver locality

2015-08-27 Thread Swapnil Shinde
. On Thursday, August 27, 2015, Swapnil Shinde swapnilushi...@gmail.com wrote: Hello I am new to spark world and started to explore recently in standalone mode. It would be great if I get clarifications on below doubts- 1. Driver locality - It is mentioned in documentation that client deploy