RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
files and directories From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:45 PM To: Evo Eftimov Cc: Subject: Re: saveAsTextFile Thanks Evo for your detailed explanation. On Apr 16, 2015, at 1:38 PM, Evo Eftimov wrote: The reason for this is

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them wherever you want - Use foreacPartition and then foreach -Original Message- From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:39 PM To: Sean Owen Cc: user@spark.apache.or

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, April 16, 2015 6:35 PM To: Vadim Bichutskiy Cc: user@spark.apache.org Subject: Re: saveAsTextFile You can't, since that's how it's designed to work. Batches ar

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
HDFS adapter and invoke it in forEachRDD and foreach Regards Evo Eftimov From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:33 PM To: user@spark.apache.org Subject: saveAsTextFile I am using Spark Streaming where during each micro-batch I

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
-on-yarn.html From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:21 PM To: Evo Eftimov; user@spark.apache.org Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Thanks Evo. Yes, my concern is only regarding the infrastructure

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Michael what exactly do you mean by "flattened" version/structure here e.g.: 1. An Object with only primitive data types as attributes 2. An Object with no more than one level of other Objects as attributes 3. An Array/List of primitive types 4. An Array/List of Objects This question is in ge

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
because all worker instances run in the memory of a single machine .. Regards, Evo Eftimov From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:03 PM To: user@spark.apache.org Subject: General configurations on CDH5 to achieve maximum Spark Performance Hi

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Ningjun, to speed up your current design you can do the following: 1.partition the large doc RDD based on the hash function on the key ie the docid 2. persist the large dataset in memory to be available for subsequent queries without reloading and repartitioning for every search query 3. parti

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
, April 16, 2015 4:32 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Data partitioning and node tracking in Spark-GraphX Thanks a lot for the reply. Indeed it is useful but to be more precise i have 3D data and want to index it using octree. Thus i aim to build a two level indexing

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
/framework your app code should not be bothered on which physical node exactly, a partition resides Regards Evo Eftimov From: MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:20 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Data partitioning and node

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
How do you intend to "fetch the required data" - from within Spark or using an app / code / module outside Spark -Original Message- From: mas [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:08 PM To: user@spark.apache.org Subject: Data partitioning and node tracking in Spa

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
which can not be done at the same time and has to be processed sequentially is a BAD thing So the key is whether it is about 1 or 2 and if it is about 1, whether it leads to e.g. Higher Throughput and Lower Latency or not Regards, Evo Eftimov From: Gerard Maas [mailto:gerard.m

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
And yet another way is to demultiplex at one point which will yield separate DStreams for each message type which you can then process in independent DAG pipelines in the following way: MessageType1DStream = MainDStream.filter(message type1) MessageType2DStream = MainDStream.filter(message t

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Also you can have each message type in a different topic (needs to be arranged upstream from your Spark Streaming app ie in the publishing systems and the messaging brokers) and then for each topic you can have a dedicated instance of InputReceiverDStream which will be the start of a dedicated D

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
that DStreams are some sort of different type of RDDs From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 11:11 PM To: Evo Eftimov Cc: user Subject: Re: RAM management during cogroup and join Well, DStream joins are nothing but RDD joins at its core. However

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
Thank you Sir, and one final confirmation/clarification - are all forms of joins in the Spark API for DStream RDDs based on cogroup in terms of their internal implementation From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 9:48 PM To: Evo Eftimov Cc: user

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
change the total number of elements included in the result RDD and RAM allocated – right? From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 9:25 PM To: Evo Eftimov Cc: user Subject: Re: RAM management during cogroup and join Significant optimizations can be made

RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
There are indications that joins in Spark are implemented with / based on the cogroup function/primitive/transform. So let me focus first on cogroup - it returns a result which is RDD consisting of essentially ALL elements of the cogrouped RDDs. Said in another way - for every key in each of the co

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
e not getting anywhere -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 8:30 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD What API differences are you talking about? a DStream gives

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
h RDDs from file for e.g. a second time moreover after specific period of time -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 8:14 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
iginal Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 7:43 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD What do you mean by "batch RDD"? they're just RDDs, though store their d

adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
The only way to join / union /cogroup a DStream RDD with Batch RDD is via the "transform" method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch. Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new Batch RDD containing the el

<    1   2