Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ovidiu-Cristian MARCU
> > > Sent from my iPhone > Pardon the dumb thumb typos :) > On Sep 1, 2016, at 7:35 AM, Ovidiu-Cristian MARCU > <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> > wrote: > >> Thank you, I like and agree with your point

Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ovidiu-Cristian MARCU
Thank you, I like and agree with your point. RDD evolved to Datasets by means of an optimizer. I just wonder what are the use cases for RDDs (other than current version of GraphX leveraging RDDs)? Best, Ovidiu > On 01 Sep 2016, at 16:26, Sean Owen wrote: > > Here's my

Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Ovidiu-Cristian MARCU
Probably the yellow warning message can be confusing even more than not receiving an answer/opinion on his post. Best, Ovidiu > On 08 Aug 2016, at 20:10, Sean Owen wrote: > > I also don't know what's going on with the "This post has NOT been > accepted by the mailing list

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
Interesting opinion, thank you Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2]. Other than this presentation [3], do you guys know any other benchmark?

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
So did you tried actually to run your use case with spark 2.0 and orc files? It’s hard to understand your ‘apparently..’. Best, Ovidiu > On 26 Jul 2016, at 13:10, Gourav Sengupta wrote: > > If you have ever tried to use ORC via SPARK you will know that SPARK's >

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Ovidiu-Cristian MARCU
I suppose you are running on 1.6. I guess you need some solution based on [1], [2] features which are coming in 2.0. [1] https://issues.apache.org/jira/browse/SPARK-12538 / https://issues.apache.org/jira/browse/SPARK-12394

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ovidiu-Cristian MARCU
Hi Ted, Any chance to develop more on the SQLConf parameters in the sense to have more explanations for changing these settings? Not all of them are made clear in the descriptions. Thanks! Best, Ovidiu > On 31 May 2016, at 16:30, Ted Yu wrote: > > Maciej: > You can refer

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Ovidiu-Cristian MARCU
Spark in relation to Tez can be like a Flink runner for Apache Beam? The use case of Tez however may be interesting (but current implementation only YARN-based?) Spark is efficient (or faster) for a number of reasons, including its ‘in-memory’ execution (from my understanding and experiments).

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Ovidiu-Cristian MARCU
I forgot to add the link to the “Technical Vision” paper so there it > is - > https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing > > From: "Sela, Amit" <ans...@paypal.com <mailto:ans...@paypal.com>> > Date: Saturday,

Re: Spark.default.parallelism can not set reduce number

2016-05-20 Thread Ovidiu-Cristian MARCU
You can check org.apache.spark.sql.internal.SQLConf for other default settings as well. val SHUFFLE_PARTITIONS = SQLConfigBuilder("spark.sql.shuffle.partitions") .doc("The default number of partitions to use when shuffling data for joins or aggregations.") .intConf

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
h-streaming-102> > On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU > <ovidiu-cristian.ma...@inria.fr> wrote: > > Hi, > > We can see in [2] many interesting (and expected!) improvements (promises) > like extended SQL support, unified API (DataFrames, DataSets), impro

What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Hi, We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
> > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <htt

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU > <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> > wro

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
t they weren’t acknowledged. > > > From: Ovidiu-Cristian MARCU <mailto:ovidiu-cristian.ma...@inria.fr> > Sent: Sunday, April 17, 2016 7:48 AM > To: andy petrella <mailto:andy.petre...@gmail.com> > Cc: Mich Talebzadeh <mailto:mich.talebza...@gmail.com>; Ascot Moss >

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
You probably read this benchmark at Yahoo, any comments from Spark? https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at > On 17 Apr 2016, at 12:41, andy

Re: Graphx

2016-03-11 Thread Ovidiu-Cristian MARCU
Hi, I wonder what version of Spark and different parameter configuration you used. I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters) John: I suppose your C++ app (algorithm) does not scale if you used

Re: off-heap certain operations

2016-02-16 Thread Ovidiu-Cristian MARCU
a > developer to know whether to use, and if you're a developer and > curious, you can just grep the code for this flag, and/or read into > what Tungsten does. > > Personally, I would leave this off. > > On Fri, Feb 12, 2016 at 6:10 PM, Ovidiu-Cristian MARCU > <ovidiu

Lost executors failed job unable to execute spark examples Triangle Count (Analytics triangles)

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi, I am able to run the Triangle Count example with some smaller graphs but when I am using http://snap.stanford.edu/data/com-Friendster.html I am not able to get the job finished ok. For some reason Spark loses its executors. No matter what

spark examples Analytics ConnectedComponents - keep running, nothing in output

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi I’m trying to run Analytics cc (ConnectedComponents) but it is running without ending. Logs are fine, but I just keep getting Job xyz finished, reduce took some time: ... INFO DAGScheduler: Job 29 finished: reduce at VertexRDDImpl.scala:90, took 14.828033 s INFO DAGScheduler: Job 30

Re: off-heap certain operations

2016-02-12 Thread Ovidiu-Cristian MARCU
I found nothing about the certain operations. Still not clear, certain is poor documentation. Can someone give an answer so I can consider using this new release? spark.memory.offHeap.enabled If true, Spark will attempt to use off-heap memory for certain operations. > On 12 Feb 2016, at 13:21,

off-heap certain operations

2016-02-11 Thread Ovidiu-Cristian MARCU
Hi, Reading though the latest documentation for Memory management I can see that the parameter spark.memory.offHeap.enabled (true by default) is described with ‘If true, Spark will attempt to use off-heap memory for certain operations’ [1]. Can you please describe the certain operations you