Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ovidiu-Cristian MARCU
Thank you! 
The talk is indeed very good.

Best,
Ovidiu
> On 01 Sep 2016, at 16:47, Jules Damji <ju...@databricks.com> wrote:
> 
> Sean put it succinctly the nuanced differences and the evolution of Datasets. 
> Simply put, structure, to some extent, limits you—and that's what the 
> DataFrames & Datasets, among other things, offer. 
> 
> When you want low-level control, dealing with unstructured data, blobs of 
> text or images, then RDDs makes sense.
> 
> There's a an illuminative talk by Michael Armbrust Structuring Spark: 
> DataFrames & Datasets, where he makes an eloquent case of their merits & 
> motivation, while also elaborates on RDDs. 
> 
> https://youtu.be/1a4pgYzeFwE <https://youtu.be/1a4pgYzeFwE>
> 
> Cheers 
> 
> Jules 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> 
> 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> On Sep 1, 2016, at 7:35 AM, Ovidiu-Cristian MARCU 
> <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> 
> wrote:
> 
>> Thank you, I like and agree with your point. RDD evolved to Datasets by 
>> means of an optimizer.
>> I just wonder what are the use cases for RDDs (other than current version of 
>> GraphX leveraging RDDs)?
>> 
>> Best,
>> Ovidiu
>> 
>>> On 01 Sep 2016, at 16:26, Sean Owen <so...@cloudera.com 
>>> <mailto:so...@cloudera.com>> wrote:
>>> 
>>> Here's my paraphrase:
>>> 
>>> Datasets are really the new RDDs. They have a similar nature
>>> (container of strongly-typed objects) but bring some optimizations via
>>> Encoders for common types.
>>> 
>>> DataFrames are different from RDDs and Datasets and do not replace and
>>> are not replaced by them. They're fundamentally for tabular data, not
>>> arbitrary objects, and thus supports SQL-like operations that only
>>> make sense on tabular  data.
>>> 
>>> On Thu, Sep 1, 2016 at 3:17 PM, Ashok Kumar
>>> <ashok34...@yahoo.com.invalid <mailto:ashok34...@yahoo.com.invalid>> wrote:
>>>> Hi,
>>>> 
>>>> What are practical differences between the new Data set in Spark 2 and the
>>>> existing DataFrame.
>>>> 
>>>> Has Dataset replaced Data Frame and what advantages it has if I use Data
>>>> Frame instead of Data Frame.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> 
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 



Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ovidiu-Cristian MARCU
Thank you, I like and agree with your point. RDD evolved to Datasets by means 
of an optimizer.
I just wonder what are the use cases for RDDs (other than current version of 
GraphX leveraging RDDs)?

Best,
Ovidiu

> On 01 Sep 2016, at 16:26, Sean Owen  wrote:
> 
> Here's my paraphrase:
> 
> Datasets are really the new RDDs. They have a similar nature
> (container of strongly-typed objects) but bring some optimizations via
> Encoders for common types.
> 
> DataFrames are different from RDDs and Datasets and do not replace and
> are not replaced by them. They're fundamentally for tabular data, not
> arbitrary objects, and thus supports SQL-like operations that only
> make sense on tabular  data.
> 
> On Thu, Sep 1, 2016 at 3:17 PM, Ashok Kumar
>  wrote:
>> Hi,
>> 
>> What are practical differences between the new Data set in Spark 2 and the
>> existing DataFrame.
>> 
>> Has Dataset replaced Data Frame and what advantages it has if I use Data
>> Frame instead of Data Frame.
>> 
>> Thanks
>> 
>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Ovidiu-Cristian MARCU
Probably the yellow warning message can be confusing even more than not 
receiving an answer/opinion on his post.

Best,
Ovidiu
> On 08 Aug 2016, at 20:10, Sean Owen  wrote:
> 
> I also don't know what's going on with the "This post has NOT been
> accepted by the mailing list yet" message, because actually the
> messages always do post. In fact this has been sent to the list 4
> times:
> 
> https://www.mail-archive.com/search?l=user%40spark.apache.org=dueckm=0=0
> 
> On Mon, Aug 8, 2016 at 3:03 PM, Chris Mattmann  wrote:
>> 
>> 
>> 
>> 
>> 
>> On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de" 
>>  wrote:
>> 
>>> Hello,
>>> 
>>> I write to you because I am not really sure whether I did everything right 
>>> when registering and subscribing to the spark user list.
>>> 
>>> I posted the appended question to Spark User list after subscribing and 
>>> receiving the "WELCOME to user@spark.apache.org" mail from 
>>> "user-h...@spark.apache.org".
>>> But this post is still in state "This post has NOT been accepted by the 
>>> mailing list yet.".
>>> 
>>> Is this because I forgot something to do or did something wrong with my 
>>> user account (dueckm)? Or is it because no member of the Spark User List 
>>> reacted to that post yet?
>>> 
>>> Thanks a lot for yout help.
>>> 
>>> Matthias
>>> 
>>> Fiducia & GAD IT AG | www.fiduciagad.de
>>> AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528 
>>> Frankfurt a. M. | USt-IdNr. DE 143582320
>>> Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv. 
>>> Vorsitzender),
>>> 
>>> Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten 
>>> Pfläging, Jörg Staff
>>> Vorsitzender des Aufsichtsrats: Jürgen Brinkmann
>>> 
>>> - Weitergeleitet von Matthias Dück/M/FAG/FIDUCIA/DE am 08.08.2016 10:57 
>>> -
>>> 
>>> Von: dueckm 
>>> An: user@spark.apache.org
>>> Datum: 04.08.2016 13:27
>>> Betreff: Are join/groupBy operations with wide Java Beans using Dataset API 
>>> much slower than using RDD API?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Hello,
>>> 
>>> I built a prototype that uses join and groupBy operations via Spark RDD API.
>>> Recently I migrated it to the Dataset API. Now it runs much slower than with
>>> the original RDD implementation.
>>> Did I do something wrong here? Or is this a price I have to pay for the more
>>> convienient API?
>>> Is there a known solution to deal with this effect (eg configuration via
>>> "spark.sql.shuffle.partitions" - but now could I determine the correct
>>> value)?
>>> In my prototype I use Java Beans with a lot of attributes. Does this slow
>>> down Spark-operations with Datasets?
>>> 
>>> Here I have an simple example, that shows the difference:
>>> JoinGroupByTest.zip
>>> 
>>> - I build 2 RDDs and join and group them. Afterwards I count and display the
>>> joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
>>> - When I do the same actions with Datasets it takes approximately 40 times
>>> as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).
>>> 
>>> Thank you very much for your help.
>>> Matthias
>>> 
>>> PS1: excuse me for sending this post more than once, but I am new to this
>>> mailing list and probably did something wrong when registering/subscribing,
>>> so my previous postings have not been accepted ...
>>> 
>>> PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
>>> RDD implementation, jobs 2/3 to Dataset):
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
Interesting opinion, thank you

Still, on the website parquet is basically inspired by Dremel (Google) [1] and 
part of orc has been enhanced while deployed for Facebook, Yahoo [2].

Other than this presentation [3], do you guys know any other benchmark?

[1]https://parquet.apache.org/documentation/latest/ 

[2]https://orc.apache.org/docs/ 
[3] 
http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet 


> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
> 
> when parquet came out it was developed by a community of companies, and was 
> designed as a library to be supported by multiple big data projects. nice
> 
> orc on the other hand initially only supported hive. it wasn't even designed 
> as a library that can be re-used. even today it brings in the kitchen sink of 
> transitive dependencies. yikes
> 
> 
> On Jul 26, 2016 5:09 AM, "Jörn Franke"  > wrote:
> I think both are very similar, but with slightly different goals. While they 
> work transparently for each Hadoop application you need to enable specific 
> support in the application for predicate push down. 
> In the end you have to check which application you are using and do some 
> tests (with correct predicate push down configuration). Keep in mind that 
> both formats work best if they are sorted on filter columns (which is your 
> responsibility) and if their optimatizations are correctly configured (min 
> max index, bloom filter, compression etc) . 
> 
> If you need to ingest sensor data you may want to store it first in hbase and 
> then batch process it in large files in Orc or parquet format.
> 
> On 26 Jul 2016, at 04:09, janardhan shetty  > wrote:
> 
>> Just wondering advantages and disadvantages to convert data into ORC or 
>> Parquet. 
>> 
>> In the documentation of Spark there are numerous examples of Parquet format. 
>> 
>> Any strong reasons to chose Parquet over ORC file format ?
>> 
>> Also : current data compression is bzip2
>> 
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>  
>> 
>>  
>> This seems like biased.



Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
So did you tried actually to run your use case with spark 2.0 and orc files?
It’s hard to understand your ‘apparently..’.

Best,
Ovidiu
> On 26 Jul 2016, at 13:10, Gourav Sengupta  wrote:
> 
> If you have ever tried to use ORC via SPARK you will know that SPARK's 
> promise of accessing ORC files is a sham. SPARK cannot access partitioned 
> tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC faster 
> and what more, if you are using SQL and have thought of using HIVE with ORC 
> on TEZ, then it runs way better, faster and leaner than SPARK. 
> 
> I can process almost a few billion records close to a terabyte in a cluster 
> with around 100GB RAM and 40 cores in a few hours, and find it a challenge 
> doing the same with SPARK. 
> 
> But apparently, everything is resolved in SPARK 2.0.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor  > wrote:
> One additional point specific to Spark 2.0 - for the alpha Structured 
> Streaming API (only),  the file sink only supports Parquet format (I'm sure 
> that limitation will be lifted in a future release before Structured 
> Streaming is GA):
>  "File sink - Stores the output to a directory. As of Spark 2.0, this 
> only supports Parquet file format, and Append output mode."
>  
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here
>  
> 
> 
> ​
> 



Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Ovidiu-Cristian MARCU
I suppose you are running on 1.6.
I guess you need some solution based on [1], [2] features which are coming in 
2.0.

[1] https://issues.apache.org/jira/browse/SPARK-12538 
 / 
https://issues.apache.org/jira/browse/SPARK-12394 

[2] https://issues.apache.org/jira/browse/SPARK-12849 


However, I did not check for examples, I would like to add to your question and 
ask the community to link to some examples with the recent improvements/changes.

It could help however to give concrete example on your specific problem, as you 
may hit some stragglers also probably caused by data skew.

Best,
Ovidiu


> On 03 Jun 2016, at 17:31, saif.a.ell...@wellsfargo.com wrote:
> 
> Hello everyone!
>  
> I was noticing that, when reading parquet files or actually any kind of 
> source data frame data (spark-csv, etc), default partinioning is not fair.
> Action tasks usually act very fast on some partitions and very slow on some 
> others, and frequently, even fast on all but last partition (which looks like 
> it reads +50% of the data input size).
>  
> I notice that each task is loading some portion of the data, say 1024MB 
> chunks, and some task loading 20+GB of data.
>  
> Applying repartition strategies solve this issue properly and general 
> performance is increased considerably, but for very large dataframes, 
> repartitioning is a costly process.
>  
> In short, what are the available strategies or configurations that help 
> reading from disk or hdfs with proper executor-data-distribution??
>  
> If this needs to be more specific, I am strictly focused on PARQUET files rom 
> HDFS. I know there are some MIN
>  
> Really appreciate,
> Saif



Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ovidiu-Cristian MARCU
Hi Ted,

Any chance to develop more on the SQLConf parameters in the sense to have more 
explanations for changing these settings?
Not all of them are made clear in the descriptions.

Thanks!

Best,
Ovidiu
> On 31 May 2016, at 16:30, Ted Yu  wrote:
> 
> Maciej:
> You can refer to the doc in 
> sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala for these 
> parameters.
> 
> On Tue, May 31, 2016 at 7:27 AM, Takeshi Yamamuro  > wrote:
> If you don't hesitate the newest version, you try to use v2.0-preview.
> http://spark.apache.org/news/spark-2.0.0-preview.html 
> 
> 
> There, you can control #partitions for input partitions without shuffles by 
> two parameters below;
> spark.sql.files.maxPartitionBytes
> spark.sql.files.openCostInBytes
> ( Not documented though, 
> 
> // maropu
> 
> On Tue, May 31, 2016 at 11:08 PM, Maciej Sokołowski  > wrote:
> After setting shuffle to true I get expected 128 partitions, but I'm worried 
> about performance of such solution - especially I see that some shuffling is 
> done because size of partitions chages:
> 
> scala> sc.textFile("hdfs:///proj/dFAB_test/testdata/perf_test1.csv", 
> minPartitions=128).coalesce(128, true).mapPartitions{rows => 
> Iterator(rows.length)}.collect()
> res3: Array[Int] = Array(768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 
> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 
> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 
> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 
> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 
> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 
> 768, 768, 768, 768, 768, 768, 768, 768, 828, 896, 896, 896, 896, 896, 896, 
> 896, 896, 896, 896, 896, 896, 850, 786, 768, 768, 768, 768, 768, 768, 768, 
> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768)
> 
> I use spark 1.6.0
> 
> 
> On 31 May 2016 at 16:02, Ted Yu  > wrote:
> Value for shuffle is false by default.
> 
> Have you tried setting it to true ?
> 
> Which Spark release are you using ?
> 
> On Tue, May 31, 2016 at 6:13 AM, Maciej Sokołowski  > wrote:
> Hello Spark users and developers.
> 
> I read file and want to ensure that it has exact number of partitions, for 
> example 128.
> 
> In documentation I found:
> 
> def textFile(path: String, minPartitions: Int = defaultMinPartitions): 
> RDD[String]
> 
> But argument here is minimal number of partitions, so I use coalesce to 
> ensure desired number of partitions:
> 
> def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: 
> Ordering[T] = null): RDD[T]
> //Return a new RDD that is reduced into numPartitions partitions.
> 
> So I combine them and get number of partitions lower than expected:
> 
> scala> sc.textFile("perf_test1.csv", 
> minPartitions=128).coalesce(128).getNumPartitions
> res14: Int = 126
> 
> Is this expected behaviour? File contains 10 lines, size of partitions 
> before and after coalesce:
> 
> scala> sc.textFile("perf_test1.csv", minPartitions=128).mapPartitions{rows => 
> Iterator(rows.length)}.collect()
> res16: Array[Int] = Array(782, 781, 782, 781, 781, 782, 781, 781, 781, 781, 
> 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 
> 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 
> 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 
> 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, 
> 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 
> 782, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 
> 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 
> 782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781)
> 
> scala> sc.textFile("perf_test1.csv", 
> minPartitions=128).coalesce(128).mapPartitions{rows => 
> Iterator(rows.length)}.collect()
> res15: Array[Int] = Array(1563, 781, 781, 781, 782, 781, 781, 781, 781, 782, 
> 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781, 782, 
> 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 
> 782, 781, 781, 781, 781, 1563, 782, 781, 781, 782, 781, 781, 781, 781, 782, 
> 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 
> 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 
> 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 
> 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 
> 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782)
> 
> So two partitions are 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Ovidiu-Cristian MARCU
Spark in relation to Tez can be like a Flink runner for Apache Beam? The use 
case of Tez however may be interesting (but current implementation only 
YARN-based?)

Spark is efficient (or faster) for a number of reasons, including its 
‘in-memory’ execution (from my understanding and experiments). If one really 
cares to dive in, just enough to read their papers which explain very well the 
optimization framework (graph-specific, MPP db, Catalyst, ML pipelines etc.) 
which Spark become after the initial RDD implementation.

What Spark is missing is a way of reaching its users by a good ‘production’ 
level, good documentation and nice feedback from the masters of this unique 
piece.

Just an opinion.

Best,
Ovidiu


> On 30 May 2016, at 21:49, Mich Talebzadeh  wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 30 May 2016 at 20:19, Michael Segel  > wrote:
> Mich, 
> 
> Most people use vendor releases because they need to have the support. 
> Hortonworks is the vendor who has the most skin in the game when it comes to 
> Tez. 
> 
> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
> Then LLAP is the in-memory piece to speed up Tez? 
> 
> HTH
> 
> -Mike
> 
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh > > wrote:
>> 
>> thanks I think the problem is that the TEZ user group is exceptionally 
>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>> built a vendor independent version.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 21:23, Jörn Franke > > wrote:
>> Well I think it is different from MR. It has some optimizations which you do 
>> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>> 
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
>> integrated in the Hortonworks distribution. 
>> 
>> 
>> On 29 May 2016, at 21:43, Mich Talebzadeh > > wrote:
>> 
>>> Hi Jorn,
>>> 
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>>> did not make enough efforts) making it work.
>>> 
>>> That TEZ user group is very quiet as well.
>>> 
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>>> in-memory capability.
>>> 
>>> It would be interesting to see what version of TEZ works as execution 
>>> engine with Hive.
>>> 
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>>> Hive etc as I am sure you already know.
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 29 May 2016 at 20:19, Jörn Franke >> > wrote:
>>> Very interesting do you plan also a test with TEZ?
>>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh >> > wrote:
>>> 
 Hi,
 
 I did another study of Hive using Spark engine compared to Hive with MR.
 
 Basically took the original table imported using Sqoop and created and 
 populated a new ORC table partitioned by year and month into 48 partitions 
 as follows:
 
 
 ​ 
 Connections use JDBC via beeline. Now for each partition using MR it takes 
 an average of 17 minutes as seen below for each PARTITION..  Now that is 
 just an individual partition and there are 48 partitions.
 
 In contrast doing the same operation with Spark engine took 10 

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Ovidiu-Cristian MARCU
Thank you, Amit! I was looking for this kind of information.

I did not fully read your paper, I see in it a TODO with basically the same 
question(s) [1], maybe someone from Spark team (including Databricks) will be 
so kind to send some feedback..

Best,
Ovidiu

[1] Integrate “Structured Streaming”: //TODO - What (and how) will Spark 2.0 
support (out-of-order, event-time windows, watermarks, triggers, accumulation 
modes) - how straight forward will it be to integrate with the Beam Model ?


> On 21 May 2016, at 23:00, Sela, Amit <ans...@paypal.com> wrote:
> 
> It seems I forgot to add the link to the “Technical Vision” paper so there it 
> is - 
> https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing
> 
> From: "Sela, Amit" <ans...@paypal.com <mailto:ans...@paypal.com>>
> Date: Saturday, May 21, 2016 at 11:52 PM
> To: Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr 
> <mailto:ovidiu-cristian.ma...@inria.fr>>, "user @spark" 
> <user@spark.apache.org <mailto:user@spark.apache.org>>
> Cc: Ovidiu Cristian Marcu <ovidiu21ma...@gmail.com 
> <mailto:ovidiu21ma...@gmail.com>>
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
> 
> This is a “Technical Vision” paper for the Spark runner, which provides 
> general guidelines to the future development of Spark’s Beam support as part 
> of the Apache Beam (incubating) project.
> This is our JIRA - 
> https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel
>  
> <https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel>
> 
> Generally, I’m currently working on Datasets integration for Batch (to 
> replace RDD) against Spark 1.6, and going towards enhancing Stream processing 
> capabilities with Structured Streaming (2.0)
> 
> And you’re welcomed to ask those questions at the Apache Beam (incubating) 
> mailing list as well ;)
> http://beam.incubator.apache.org/mailing_lists/ 
> <http://beam.incubator.apache.org/mailing_lists/>
> 
> Thanks,
> Amit
> 
> From: Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr 
> <mailto:ovidiu-cristian.ma...@inria.fr>>
> Date: Tuesday, May 17, 2016 at 12:11 AM
> To: "user @spark" <user@spark.apache.org <mailto:user@spark.apache.org>>
> Cc: Ovidiu Cristian Marcu <ovidiu21ma...@gmail.com 
> <mailto:ovidiu21ma...@gmail.com>>
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
> 
> Could you please consider a short answer regarding the Apache Beam Capability 
> Matrix todo’s for future Spark 2.0 release [4]? (some related references 
> below [5][6])
> 
> Thanks
> 
> [4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what 
> <http://beam.incubator.apache.org/capability-matrix/#cap-full-what>
> [5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 
> <https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101>
> [6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 
> <https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102>
> 
>> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
>> <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> 
>> wrote:
>> 
>> Hi,
>> 
>> We can see in [2] many interesting (and expected!) improvements (promises) 
>> like extended SQL support, unified API (DataFrames, DataSets), improved 
>> engine (Tungsten relates to ideas from modern compilers and MPP databases - 
>> similar to Flink [3]), structured streaming etc. It seems we somehow assist 
>> at a smart unification of Big Data analytics (Spark, Flink - best of two 
>> worlds)!
>> 
>> How does Spark respond to the missing What/Where/When/How questions 
>> (capabilities) highlighted in the unified model Beam [1] ?
>> 
>> Best,
>> Ovidiu
>> 
>> [1] 
>> https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
>>  
>> <https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective>
>> [2] 
>> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
>>  
>> <https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html>
>> [3] http://stratosphere.eu/project/publications/ 
>> <http://stratosphere.eu/project/publications/>
>> 
>> 
> 



Re: Spark.default.parallelism can not set reduce number

2016-05-20 Thread Ovidiu-Cristian MARCU
You can check org.apache.spark.sql.internal.SQLConf for other default settings 
as well.
  val SHUFFLE_PARTITIONS = SQLConfigBuilder("spark.sql.shuffle.partitions")
.doc("The default number of partitions to use when shuffling data for joins 
or aggregations.")
.intConf
.createWithDefault(200)


> On 20 May 2016, at 13:17, 喜之郎 <251922...@qq.com> wrote:
> 
>  Hi all.
> I set Spark.default.parallelism equals 20 in spark-default.conf. And send 
> this file to all nodes.
> But I found reduce number is still default value,200.
> Does anyone else encouter this problem? can anyone give some advice?
> 
> 
> [Stage 9:>(0 + 0) / 
> 200]
> [Stage 9:>(0 + 2) / 
> 200]
> [Stage 9:>(1 + 2) / 
> 200]
> [Stage 9:>(2 + 2) / 
> 200]
> ###
> 
> And this results in many empty files.Because my data is little, only some of 
> the 200 files have data.
> ###
>  2016-05-20 17:01 
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-0
>  2016-05-20 17:01 
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-1
>  2016-05-20 17:01 
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-2
>  2016-05-20 17:01 
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-3
>  2016-05-20 17:01 
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-4
>  2016-05-20 17:01 
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-5
> 
> 
> 
> 



Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Could you please consider a short answer regarding the Apache Beam Capability 
Matrix todo’s for future Spark 2.0 release [4]? (some related references below 
[5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what 
<http://beam.incubator.apache.org/capability-matrix/#cap-full-what>
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 
<https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101>
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 
<https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102>

> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
> <ovidiu-cristian.ma...@inria.fr> wrote:
> 
> Hi,
> 
> We can see in [2] many interesting (and expected!) improvements (promises) 
> like extended SQL support, unified API (DataFrames, DataSets), improved 
> engine (Tungsten relates to ideas from modern compilers and MPP databases - 
> similar to Flink [3]), structured streaming etc. It seems we somehow assist 
> at a smart unification of Big Data analytics (Spark, Flink - best of two 
> worlds)!
> 
> How does Spark respond to the missing What/Where/When/How questions 
> (capabilities) highlighted in the unified model Beam [1] ?
> 
> Best,
> Ovidiu
> 
> [1] 
> https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
>  
> <https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective>
> [2] 
> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
>  
> <https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html>
> [3] http://stratosphere.eu/project/publications/ 
> <http://stratosphere.eu/project/publications/>
> 
> 



What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like 
extended SQL support, unified API (DataFrames, DataSets), improved engine 
(Tungsten relates to ideas from modern compilers and MPP databases - similar to 
Flink [3]), structured streaming etc. It seems we somehow assist at a smart 
unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions 
(capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] 
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
 

[2] 
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
 

[3] http://stratosphere.eu/project/publications/ 





Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
The Streaming use case is important IMO, as Spark (like Flink) advocates for 
the unification of analytics tools, so having all in one, batch and graph 
processing, sql, ml and streaming.

> On 17 Apr 2016, at 17:07, Corey Nolet <cjno...@gmail.com> wrote:
> 
> One thing I've noticed about Flink in my following of the project has been 
> that it has established, in a few cases, some novel ideas and improvements 
> over Spark. The problem with it, however, is that both the development team 
> and the community around it are very small and many of those novel 
> improvements have been rolled directly into Spark in subsequent versions. I 
> was considering changing over my architecture to Flink at one point to get 
> better, more real-time CEP streaming support, but in the end I decided to 
> stick with Spark and just watch Flink continue to pressure it into 
> improvement.
> 
> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> i never found much info that flink was actually designed to be fault 
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that 
> doesn't bode well for large scale data processing. spark was designed with 
> fault tolerance in mind from the beginning.
> 
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> Hi,
> 
> I read the benchmark published by Yahoo. Obviously they already use Storm and 
> inevitably very familiar with that tool. To start with although these 
> benchmarks were somehow interesting IMO, it lend itself to an assurance that 
> the tool chosen for their platform is still the best choice. So inevitably 
> the benchmarks and the tests were done to support primary their approach.
> 
> In general anything which is not done through TCP Council or similar body is 
> questionable..
> Their argument is that because Spark handles data streaming in micro batches 
> then inevitably it introduces this in-built latency as per design. In 
> contrast, both Storm and Flink do not (at the face value) have this issue.
> 
> In addition as we already know Spark has far more capabilities compared to 
> Flink (know nothing about Storm). So really it boils down to the business SLA 
> to choose which tool one wants to deploy for your use case. IMO Spark micro 
> batching approach is probably OK for 99% of use cases. If we had in built 
> libraries for CEP for Spark (I am searching for it), I would not bother with 
> Flink.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU 
> <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> 
> wrote:
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> <https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at>
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella <andy.petre...@gmail.com 
>> <mailto:andy.petre...@gmail.com>> wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAA

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
For the streaming case Flink is fault tolerant (DataStream API), for the batch 
case (DataSet API) not yet, as from my research regarding their platform.

> On 17 Apr 2016, at 17:03, Koert Kuipers <ko...@tresata.com> wrote:
> 
> i never found much info that flink was actually designed to be fault 
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that 
> doesn't bode well for large scale data processing. spark was designed with 
> fault tolerance in mind from the beginning.
> 
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> Hi,
> 
> I read the benchmark published by Yahoo. Obviously they already use Storm and 
> inevitably very familiar with that tool. To start with although these 
> benchmarks were somehow interesting IMO, it lend itself to an assurance that 
> the tool chosen for their platform is still the best choice. So inevitably 
> the benchmarks and the tests were done to support primary their approach.
> 
> In general anything which is not done through TCP Council or similar body is 
> questionable..
> Their argument is that because Spark handles data streaming in micro batches 
> then inevitably it introduces this in-built latency as per design. In 
> contrast, both Storm and Flink do not (at the face value) have this issue.
> 
> In addition as we already know Spark has far more capabilities compared to 
> Flink (know nothing about Storm). So really it boils down to the business SLA 
> to choose which tool one wants to deploy for your use case. IMO Spark micro 
> batching approach is probably OK for 99% of use cases. If we had in built 
> libraries for CEP for Spark (I am searching for it), I would not bother with 
> Flink.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU 
> <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> 
> wrote:
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> <https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at>
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella <andy.petre...@gmail.com 
>> <mailto:andy.petre...@gmail.com>> wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 17 April 2016 at 02:41, Ascot Moss <ascot.m...@gmail.com 
>> <mailto:ascot.m...@gmail.com>> wrote:
>> I compared both last month, seems to me that Flink's MLLib is not yet ready.
>> 
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Thanks Ted. I was wondering if someone is using both :)
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?i

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
Hi Mich,

IMO one will try to see if there is an alternative, a better one at least.
This benchmark could be a good starting point.

Best,
Ovidiu
> On 17 Apr 2016, at 15:52, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
> 
> I read the benchmark published by Yahoo. Obviously they already use Storm and 
> inevitably very familiar with that tool. To start with although these 
> benchmarks were somehow interesting IMO, it lend itself to an assurance that 
> the tool chosen for their platform is still the best choice. So inevitably 
> the benchmarks and the tests were done to support primary their approach.
> 
> In general anything which is not done through TCP Council or similar body is 
> questionable..
> Their argument is that because Spark handles data streaming in micro batches 
> then inevitably it introduces this in-built latency as per design. In 
> contrast, both Storm and Flink do not (at the face value) have this issue.
> 
> In addition as we already know Spark has far more capabilities compared to 
> Flink (know nothing about Storm). So really it boils down to the business SLA 
> to choose which tool one wants to deploy for your use case. IMO Spark micro 
> batching approach is probably OK for 99% of use cases. If we had in built 
> libraries for CEP for Spark (I am searching for it), I would not bother with 
> Flink.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU 
> <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> 
> wrote:
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> <https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at>
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella <andy.petre...@gmail.com 
>> <mailto:andy.petre...@gmail.com>> wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 17 April 2016 at 02:41, Ascot Moss <ascot.m...@gmail.com 
>> <mailto:ascot.m...@gmail.com>> wrote:
>> I compared both last month, seems to me that Flink's MLLib is not yet ready.
>> 
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Thanks Ted. I was wondering if someone is using both :)
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 16 April 2016 at 17:08, Ted Yu <yuzhih...@gmail.com 
>> <mailto:yuzhih...@gmail.com>> wrote:
>> Looks like this question is more relevant on flink mailing list :-)
>> 
>> On Sat, Apr 16, 2016 at 8:52 

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
Yes, mostly regarding spark partitioning and use of groupByKey instead of 
reduceByKey.
However, Flink extended the benchmark here 
http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ 
<http://data-artisans.com/extending-the-yahoo-streaming-benchmark/>
So I was curious about an answer from Spark team, do they plan to do something 
similar.

> On 17 Apr 2016, at 15:33, Silvio Fiorito <silvio.fior...@granturing.com> 
> wrote:
> 
> Actually there were multiple responses to it on the GitHub project, including 
> a PR to improve the Spark code, but they weren’t acknowledged.
>  
>  
> From: Ovidiu-Cristian MARCU <mailto:ovidiu-cristian.ma...@inria.fr>
> Sent: Sunday, April 17, 2016 7:48 AM
> To: andy petrella <mailto:andy.petre...@gmail.com>
> Cc: Mich Talebzadeh <mailto:mich.talebza...@gmail.com>; Ascot Moss 
> <mailto:ascot.m...@gmail.com>; Ted Yu <mailto:yuzhih...@gmail.com>; user 
> @spark <mailto:user@spark.apache.org>
> Subject: Re: Apache Flink
>  
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> <https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at>
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella <andy.petre...@gmail.com 
>> <mailto:andy.petre...@gmail.com>> wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 17 April 2016 at 02:41, Ascot Moss <ascot.m...@gmail.com 
>> <mailto:ascot.m...@gmail.com>> wrote:
>> I compared both last month, seems to me that Flink's MLLib is not yet ready.
>> 
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Thanks Ted. I was wondering if someone is using both :)
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 16 April 2016 at 17:08, Ted Yu <yuzhih...@gmail.com 
>> <mailto:yuzhih...@gmail.com>> wrote:
>> Looks like this question is more relevant on flink mailing list :-)
>> 
>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Hi,
>> 
>> Has anyone used Apache Flink instead of Spark by any chance
>> 
>> I am interested in its set of libraries for Complex Event Processing.
>> 
>> Frankly I don't know if it offers far more than Spark offers.
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> 
>> 
>> 
>> -- 
>> andy



Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
You probably read this benchmark at Yahoo, any comments from Spark?
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
 



> On 17 Apr 2016, at 12:41, andy petrella  wrote:
> 
> Just adding one thing to the mix: `that the latency for streaming data is 
> eliminated` is insane :-D
> 
> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh  > wrote:
>  It seems that Flink argues that the latency for streaming data is eliminated 
> whereas with Spark RDD there is this latency.
> 
> I noticed that Flink does not support interactive shell much like Spark shell 
> where you can add jars to it to do kafka testing. The advice was to add the 
> streaming Kafka jar file to CLASSPATH but that does not work.
> 
> Most Flink documentation also rather sparce with the usual example of word 
> count which is not exactly what you want.
> 
> Anyway I will have a look at it further. I have a Spark Scala streaming Kafka 
> program that works fine in Spark and I want to recode it using Scala for 
> Flink with Kafka but have difficulty importing and testing libraries.
> 
> Cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 17 April 2016 at 02:41, Ascot Moss  > wrote:
> I compared both last month, seems to me that Flink's MLLib is not yet ready.
> 
> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh  > wrote:
> Thanks Ted. I was wondering if someone is using both :)
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 16 April 2016 at 17:08, Ted Yu  > wrote:
> Looks like this question is more relevant on flink mailing list :-)
> 
> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh  > wrote:
> Hi,
> 
> Has anyone used Apache Flink instead of Spark by any chance
> 
> I am interested in its set of libraries for Complex Event Processing.
> 
> Frankly I don't know if it offers far more than Spark offers.
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> 
> 
> 
> -- 
> andy



Re: Graphx

2016-03-11 Thread Ovidiu-Cristian MARCU
Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time? 

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

> On 11 Mar 2016, at 16:02, John Lilley  wrote:
> 
> A colleague did the experiments and I don’t know exactly how he observed 
> that.  I think it was indirect from the Spark diagnostics indicating the 
> amount of I/O he deduced that this was RDD serialization.  Also when he added 
> light compression to RDD serialization this improved matters.
>  
> John Lilley
> Chief Architect, RedPoint Global Inc.
> T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
> Skype: jlilley.redpoint | john.lil...@redpoint.net 
>  | www.redpoint.net 
> 
>  
> From: lihu [mailto:lihu...@gmail.com] 
> Sent: Friday, March 11, 2016 7:58 AM
> To: John Lilley 
> Cc: Andrew A ; u...@spark.incubator.apache.org
> Subject: Re: Graphx
>  
> Hi, John:
>I am very intersting in your experiment, How can you get that RDD 
> serialization cost lots of time, from the log or some other tools?
>  
> On Fri, Mar 11, 2016 at 8:46 PM, John Lilley  > wrote:
> Andrew,
>  
> We conducted some tests for using Graphx to solve the connected-components 
> problem and were disappointed.  On 8 nodes of 16GB each, we could not get 
> above 100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  
> RDD serialization would take excessive time and then we would get failures.  
> By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk 
> on a single 16GB node in about an hour.  I think that a very large cluster 
> will do better, but we did not explore that.
>  
> John Lilley
> Chief Architect, RedPoint Global Inc.
> T: +1 303 541 1516   | M: +1 720 938 5761 
>  | F: +1 781-705-2077 
> Skype: jlilley.redpoint | john.lil...@redpoint.net 
>  | www.redpoint.net 
> 
>  
> From: Andrew A [mailto:andrew.a...@gmail.com ] 
> Sent: Thursday, March 10, 2016 2:44 PM
> To: u...@spark.incubator.apache.org 
> Subject: Graphx
>  
> Hi, is there anyone who use graphx in production? What maximum size of graphs 
> did you process by spark and what cluster are you use for it?
> 
> i tried calculate pagerank for 1 Gb edges LJ - dataset for 
> LiveJournalPageRank from spark examples and i faced with large volume 
> shuffles produced by spark which fail my spark job.
> 
> Thank you,
> Andrew



Re: off-heap certain operations

2016-02-16 Thread Ovidiu-Cristian MARCU
Well, it is quite important the off-heap setting and now I am curios about 
other parameters, I hope everything else is well documented or not missleading.

Best,
Ovidiu
> On 12 Feb 2016, at 19:18, Sean Owen <so...@cloudera.com> wrote:
> 
> I don't think much more is said since in fact it would affect parts of
> the implementations of lots of operations -- anything touching
> Tungsten. It wouldn't be meaningful to try to list everything.
> 
> The difference is allocating memory on the heap or with
> sun.misc.Unsafe. This is definitely something you'd need to be a
> developer to know whether to use, and if you're a developer and
> curious, you can just grep the code for this flag, and/or read into
> what Tungsten does.
> 
> Personally, I would leave this off.
> 
> On Fri, Feb 12, 2016 at 6:10 PM, Ovidiu-Cristian MARCU
> <ovidiu-cristian.ma...@inria.fr> wrote:
>> I found nothing about the certain operations. Still not clear, certain is
>> poor documentation. Can someone give an answer so I can consider using this
>> new release?
>> spark.memory.offHeap.enabled
>> 
>> If true, Spark will attempt to use off-heap memory for certain operations.
>> 
>> On 12 Feb 2016, at 13:21, Ted Yu <yuzhih...@gmail.com> wrote:
>> 
>> SP
>> 
>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Lost executors failed job unable to execute spark examples Triangle Count (Analytics triangles)

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi,

I am able to run the Triangle Count example with some smaller graphs but when I 
am using http://snap.stanford.edu/data/com-Friendster.html 

I am not able to get the job finished ok. For some reason Spark loses its 
executors.
No matter what I use to configure spark (1.5) I just receive errors, the last 
configuration I’ve used was running for some time than it gave executors lost 
errors.

Some exceptions/errors I got:

ERROR LiveListenerBus: Listener JobProgressListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:362)
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at 
org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:361)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)

ERROR MapOutputTracker: Missing an output location for shuffle 1

ERROR TaskSchedulerImpl: Lost executor 29 on 172.16.96.49: worker lost
ERROR TaskSchedulerImpl: Lost executor 24 on 172.16.96.39: worker lost

16/02/16 12:41:47 WARN HeartbeatReceiver: Removing executor 8 with no recent 
heartbeats: 168312 ms exceeds timeout 12 ms
16/02/16 12:41:47 ERROR TaskSchedulerImpl: Lost executor 8 on 172.16.96.53: 
Executor heartbeat timed out after 168312 ms

16/02/16 12:41:47 ERROR TaskSchedulerImpl: Lost executor 9 on 172.16.96.9: 
Executor heartbeat timed out after 163671 ms

16/02/16 12:53:53 ERROR TaskSetManager: Task 9 in stage 6.2 failed 4 times; 
aborting job

16/02/16 12:54:42 ERROR ContextCleaner: Error cleaning broadcast 19
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
seconds]. This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:214)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:170)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:161)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:161)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 

spark examples Analytics ConnectedComponents - keep running, nothing in output

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi

I’m trying to run Analytics cc (ConnectedComponents) but it is running without 
ending.
Logs are fine, but I just keep getting Job xyz finished, reduce took some time:

...
INFO DAGScheduler: Job 29 finished: reduce at VertexRDDImpl.scala:90, took 
14.828033 s
INFO DAGScheduler: Job 30 finished: reduce at VertexRDDImpl.scala:90, took 
15.341294 s
..

..
INFO TaskSetManager: Finished task 299.0 in stage 53059.0 (TID 88025) in 81 ms 
on 172.16.99.22 (195/480)
INFO TaskSetManager: Starting task 47.0 in stage 53059.0 (TID 88075, 
172.16.99.31, PROCESS_LOCAL, 5367 bytes)
..

I am using Spark 1.5 standalone and input graph 
http://snap.stanford.edu/data/web-BerkStan.html 


It seems there is no convergence, can you help me understand what is wrong in 
your example?

Thanks

Best,
Ovidiu

Re: off-heap certain operations

2016-02-12 Thread Ovidiu-Cristian MARCU
I found nothing about the certain operations. Still not clear, certain is poor 
documentation. Can someone give an answer so I can consider using this new 
release?
spark.memory.offHeap.enabled

If true, Spark will attempt to use off-heap memory for certain operations.

> On 12 Feb 2016, at 13:21, Ted Yu  wrote:
> 
> SP



off-heap certain operations

2016-02-11 Thread Ovidiu-Cristian MARCU
Hi,

Reading though the latest documentation for Memory management I can see that 
the parameter spark.memory.offHeap.enabled (true by default) is described with 
‘If true, Spark will attempt to use off-heap memory for certain operations’ [1].

Can you please describe the certain operations you are referring to?  

http://spark.apache.org/docs/latest/configuration.html#memory-management 


Thank!

Best,
Ovidiu