files and
directories
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Thursday, April 16, 2015 6:45 PM
To: Evo Eftimov
Cc:
Subject: Re: saveAsTextFile
Thanks Evo for your detailed explanation.
On Apr 16, 2015, at 1:38 PM, Evo Eftimov wrote:
The reason for this is
Basically you need to unbundle the elements of the RDD and then store them
wherever you want - Use foreacPartition and then foreach
-Original Message-
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Thursday, April 16, 2015 6:39 PM
To: Sean Owen
Cc: user@spark.apache.or
Nop Sir, it is possible - check my reply earlier
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Thursday, April 16, 2015 6:35 PM
To: Vadim Bichutskiy
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile
You can't, since that's how it's designed to work. Batches ar
HDFS adapter and invoke it in forEachRDD and foreach
Regards
Evo Eftimov
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Thursday, April 16, 2015 6:33 PM
To: user@spark.apache.org
Subject: saveAsTextFile
I am using Spark Streaming where during each micro-batch I
-on-yarn.html
From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Thursday, April 16, 2015 6:21 PM
To: Evo Eftimov; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance
Thanks Evo. Yes, my concern is only regarding the infrastructure
Michael what exactly do you mean by "flattened" version/structure here e.g.:
1. An Object with only primitive data types as attributes
2. An Object with no more than one level of other Objects as attributes
3. An Array/List of primitive types
4. An Array/List of Objects
This question is in ge
because all worker instances run in the memory of a single
machine ..
Regards,
Evo Eftimov
From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance
Hi
Ningjun, to speed up your current design you can do the following:
1.partition the large doc RDD based on the hash function on the key ie the docid
2. persist the large dataset in memory to be available for subsequent queries
without reloading and repartitioning for every search query
3. parti
, April 16, 2015 4:32 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Data partitioning and node tracking in Spark-GraphX
Thanks a lot for the reply. Indeed it is useful but to be more precise i have
3D data and want to index it using octree. Thus i aim to build a two level
indexing
/framework your app
code should not be bothered on which physical node exactly, a partition resides
Regards
Evo Eftimov
From: MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com]
Sent: Thursday, April 16, 2015 4:20 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Data partitioning and node
How do you intend to "fetch the required data" - from within Spark or using
an app / code / module outside Spark
-Original Message-
From: mas [mailto:mas.ha...@gmail.com]
Sent: Thursday, April 16, 2015 4:08 PM
To: user@spark.apache.org
Subject: Data partitioning and node tracking in Spa
which can not be done at the same time and has to be
processed sequentially is a BAD thing
So the key is whether it is about 1 or 2 and if it is about 1, whether it leads
to e.g. Higher Throughput and Lower Latency or not
Regards,
Evo Eftimov
From: Gerard Maas [mailto:gerard.m
And yet another way is to demultiplex at one point which will yield separate
DStreams for each message type which you can then process in independent DAG
pipelines in the following way:
MessageType1DStream = MainDStream.filter(message type1)
MessageType2DStream = MainDStream.filter(message t
Also you can have each message type in a different topic (needs to be arranged
upstream from your Spark Streaming app ie in the publishing systems and the
messaging brokers) and then for each topic you can have a dedicated instance of
InputReceiverDStream which will be the start of a dedicated D
that DStreams are some sort of different type of RDDs
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Wednesday, April 15, 2015 11:11 PM
To: Evo Eftimov
Cc: user
Subject: Re: RAM management during cogroup and join
Well, DStream joins are nothing but RDD joins at its core. However
Thank you Sir, and one final confirmation/clarification - are all forms of
joins in the Spark API for DStream RDDs based on cogroup in terms of their
internal implementation
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Wednesday, April 15, 2015 9:48 PM
To: Evo Eftimov
Cc: user
change the total number of elements
included in the result RDD and RAM allocated – right?
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Wednesday, April 15, 2015 9:25 PM
To: Evo Eftimov
Cc: user
Subject: Re: RAM management during cogroup and join
Significant optimizations can be made
There are indications that joins in Spark are implemented with / based on the
cogroup function/primitive/transform. So let me focus first on cogroup - it
returns a result which is RDD consisting of essentially ALL elements of the
cogrouped RDDs. Said in another way - for every key in each of the co
e not getting anywhere
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 15, 2015 8:30 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: adding new elements to batch RDD from DStream RDD
What API differences are you talking about? a DStream gives
h RDDs from file for e.g. a second
time moreover after specific period of time
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 15, 2015 8:14 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: adding new elements to batch RDD from DStream RDD
iginal Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 15, 2015 7:43 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: adding new elements to batch RDD from DStream RDD
What do you mean by "batch RDD"? they're just RDDs, though store their d
The only way to join / union /cogroup a DStream RDD with Batch RDD is via the
"transform" method, which returns another DStream RDD and hence it gets
discarded at the end of the micro-batch.
Is there any way to e.g. union Dstream RDD with Batch RDD which produces a
new Batch RDD containing the el
101 - 122 of 122 matches
Mail list logo