Hello All,
I had a question regarding the performance optimization (Catalyst
Optimizer) of DataFrames. I understand that DataFrames are interoperable
with RDDs. If I switch back and forth between DataFrames and RDDs, does the
performance optimization still kick-in? I need to switch to RDDs to
Thank you for the comments. As you mentioned, increasing the thread pool
succeeded to allow more parallel jobs and decreasing #partitions allowed
more RDDs to execute in parallel. Much appreciated
On Aug 31, 2015 7:07 AM, "Igor Berman" wrote:
> what is size of the pool you submitti
what is size of the pool you submitting spark jobs from(futures you've
mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't
be more than 8 parallel jobs running...so try to increase it
what is number of partitions of each of your rdds?
how many cores has your work
Hi, I have a large number of RDDs that I need to process separately.
Instead of submitting these jobs to the Spark scheduler one by one, I'd
like to submit them in parallel in order to maximize cluster utilization.
I've tried to process the RDDs as Futures, but the number of Active jobs
text...
>> On Aug 21, 2015 12:06 AM, "Rishitesh Mishra"
>> wrote:
>> I am not sure if you can view all RDDs in a session. Tables are maintained
>> in a catalogue . Hence its easier. However you can see the DAG
>> representation , which lists all the RD
You get the list of all the persistet rdd using spark context...
On Aug 21, 2015 12:06 AM, "Rishitesh Mishra"
wrote:
> I am not sure if you can view all RDDs in a session. Tables are maintained
> in a catalogue . Hence its easier. However you can see the DAG
> representatio
I am not sure if you can view all RDDs in a session. Tables are maintained
in a catalogue . Hence its easier. However you can see the DAG
representation , which lists all the RDDs in a job , with Spark UI.
On 20 Aug 2015 22:34, "Dhaval Patel" wrote:
> Apologies
>
> I ac
Apologies
I accidentally included Spark User DL on BCC. The actual email message is
below.
=
Hi:
I have been working on few example using zeppelin.
I have been trying to find a command that would list all *dataframes/RDDs*
that
Hi:
I have been working on few example using zeppelin.
I have been trying to find a command that would list all *dataframes/RDDs*
that has been created in current session. Anyone knows if there is any such
commands available?
Something similar to SparkSQL to list all temp tables :
show
, Matthew O'Reilly a
écrit :
> Hi,
>
> I am currently working on the latest version of Apache Spark (1.4.1),
> pre-built package for Hadoop 2.6+.
>
> Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
> (something similar is Altibase's HDB:
> http
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA
to add this feature and may be in future release it could be added.
Thanks
Best Regards
On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly
wrote:
> Hi,
>
> I am currently working on the latest version o
Hi,
I am currently working on the latest version of Apache Spark (1.4.1), pre-built
package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
(something similar is Altibase's HDB:
http://altibase.com/in-memory-database-computing-solutions/sec
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p24049.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user
Hi, Alice
Did you find solution?
I have exactly the same problem.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p24049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi Stahlman,
finalRDDStorageLevel is the storage level for the final user/item
factors. It is not common to set it to StorageLevel.NONE, unless you
want to save the factors directly to disk. So if it is NONE, we cannot
unpersist the intermediate RDDs (in/out blocks) because the final
user/item
Fri, Jul 24, 2015 at 2:12 PM, IASIB1 wrote:
> I am currently working on the latest version of Apache Spark (1.4.1),
> pre-built package for Hadoop 2.6+.
>
> Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory
> (similarly
> to Altibase's HDB:
> http://al
I am currently working on the latest version of Apache Spark (1.4.1),
pre-built package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly
to Altibase's HDB:
http://altibase.com/in-memory-database-computing-solutions/security/
<http://altibas
Yes. Tachyon can handle this well: http://tachyon-project.org/
Best,
Haoyuan
On Wed, Jul 22, 2015 at 10:56 AM, swetha wrote:
> Hi,
>
> We have a requirement wherein we need to keep RDDs in memory between Spark
> batch processing that happens every one hour. The idea here is
abe/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala
>>> .
>>>
>>> -Andrew
>>>
>>> 2015-07-21 19:56 GMT-07:00 ayan guha :
>>>
>>>> Either you have to do rdd.collect and then broadcast or you can do a
>>>&
I was about say whatever the previous post said,so +1 to the previous
post,from my understanding (gut feeling) of your requirement it very easy to
do this with spark-job-server.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory
e to do rdd.collect and then broadcast or you can do a join
>>> On 22 Jul 2015 07:54, "Dan Dong" wrote:
>>>
>>>> Hi, All,
>>>>
>>>>
>>>> I am trying to access a Map from RDDs that are on different compute
>>>
do rdd.collect and then broadcast or you can do a join
>> On 22 Jul 2015 07:54, "Dan Dong" wrote:
>>
>>> Hi, All,
>>>
>>>
>>> I am trying to access a Map from RDDs that are on different compute
>>> nodes, but without success. The
Actually, I should clarify - Tachyon is a way to keep your data in RAM, but
it's not exactly the same as keeping it cached in Spark. Spark Job Server
is a way to keep it cached in Spark.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RD
Tachyon is one way. Also check out the Spark Job Server
<https://github.com/spark-jobserver/spark-jobserver> .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957p23958.html
Sent fr
talone.com<mailto:jonathan.stahl...@capitalone.com>]
Sent: Wednesday, July 22, 2015 01:42 PM Eastern Standard Time
To: user@spark.apache.org
Subject: Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel
Hello again,
In trying to understand the caching of intermediate RDDs by ALS, I looked into
Hi Burak,
Looking at the source code, the intermediate RDDs used in ALS.train() are
persisted during the computation using intermediateRDDStorageLevel (default
value is StorageLevel.MEMORY_AND_DISK) - see
here<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark
Hi,
We have a requirement wherein we need to keep RDDs in memory between Spark
batch processing that happens every one hour. The idea here is to have RDDs
that have active user sessions in memory between two jobs so that once a job
processing is done and another job is run after an hour the RDDs
l 22, 2015 at 10:38 AM, Stahlman, Jonathan <
jonathan.stahl...@capitalone.com> wrote:
> Hello again,
>
> In trying to understand the caching of intermediate RDDs by ALS, I looked
> into the source code and found what may be a bug. Looking here:
>
>
> https://github.com/ap
Hello again,
In trying to understand the caching of intermediate RDDs by ALS, I looked into
the source code and found what may be a bug. Looking here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L230
you see that ALS.train
27;s the most memory
intensive.
-Andrew
2015-07-21 13:47 GMT-07:00 wdbaruni :
> I am new to Spark and I understand that Spark divides the executor memory
> into the following fractions:
>
> *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
> .ca
rg/apache/spark/examples/BroadcastTest.scala
.
-Andrew
2015-07-21 19:56 GMT-07:00 ayan guha :
> Either you have to do rdd.collect and then broadcast or you can do a join
> On 22 Jul 2015 07:54, "Dan Dong" wrote:
>
>> Hi, All,
>>
>>
>> I am trying to acc
Either you have to do rdd.collect and then broadcast or you can do a join
On 22 Jul 2015 07:54, "Dan Dong" wrote:
> Hi, All,
>
>
> I am trying to access a Map from RDDs that are on different compute nodes,
> but without success. The Map is like:
>
> val map
Hi, All,
I am trying to access a Map from RDDs that are on different compute nodes,
but without success. The Map is like:
val map1 = Map("aa"->1,"bb"->2,"cc"->3,...)
All RDDs will have to check against it to see if the key is in the Map or
not, so see
I am new to Spark and I understand that Spark divides the executor memory
into the following fractions:
*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
.cache() and can be defined by setting spark.storage.memoryFraction (default
0.6)
*Shuffle and aggregation buffers
. A sample code in python is
copied below.
The issue I have is that each new model which is trained caches a set of RDDs
and eventually the executors run out of memory. Is there any way in Pyspark to
unpersist() these RDDs after each iteration? The names of the RDDs which I
gather from the UI
On Thu, Jul 16, 2015 at 7:37 AM Brandon White
wrote:
> Hello,
>
> I have a list of rdds
>
> List(rdd1, rdd2, rdd3,rdd4)
>
> I would like to save these rdds in parallel. Right now, it is running each
> operation sequentially. I tried using a rdd of rdd but that does no
sc.union(rdds).saveAsTextFile()
On Wed, Jul 15, 2015 at 10:37 PM, Brandon White wrote:
> Hello,
>
> I have a list of rdds
>
> List(rdd1, rdd2, rdd3,rdd4)
>
> I would like to save these rdds in parallel. Right now, it is running each
> operation sequentially. I tried usi
Hello,
I have a list of rdds
List(rdd1, rdd2, rdd3,rdd4)
I would like to save these rdds in parallel. Right now, it is running each
operation sequentially. I tried using a rdd of rdd but that does not work.
list.foreach { rdd =>
rdd.saveAsTextFile("/tmp/cache/")
}
Any ideas?
in my Spark Streaming program
>>> (Java):
>>>
>>> dStream.foreachRDD((rdd, batchTime) -> {
>>> log.info("processing RDD from batch {}", batchTime);
>>>
>>> // my rdd processing code
>>>
t;>
>> Instead of having my rdd processing code called once for each RDD in the
>> batch, is it possible to essentially group all of the RDDs from the batch
>> into a single RDD and single partition and therefore operate on all of the
>> elements in the batch at once?
&g
called once for each RDD in the
> batch, is it possible to essentially group all of the RDDs from the batch
> into a single RDD and single partition and therefore operate on all of the
> elements in the batch at once?
>
> My goal here is to do an operation exactly once for every bat
stead of having my rdd processing code called once for each RDD in the
batch, is it possible to essentially group all of the RDDs from the batch
into a single RDD and single partition and therefore operate on all of the
elements in the batch at once?
My goal here is to do an operation exactly on
ailable"?
>
> Also, what are the correct imports to get this working?
>
> I'm using sbt assembly to try to compile these files, and would really
> appreciate any help.
>
> Thanks,
> Ashley Wang
>
>
>
> --
> View this message in context:
> ht
y Wang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/correct-Scala-Imports-for-creating-DFs-from-RDDs-tp23829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
>> From: Tathagata Das
>> Date: 20 June 2015 at 17:21
>> Subject: Re: Serial batching with Spark Streaming
>> To: Michal Čizmazia
>> Cc: Binh Nguyen Van , user
>>
>>
>> No it does not. By default, only after all the retries etc related to
>> ba
tarted.
>
> Yes, one RDD per batch per DStream. However, the RDD could be a union of
> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
> DStream).
>
> TD
>
> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia
> wrote:
> Thanks Tathagata!
>
> I wil
: Re: Serial batching with Spark Streaming
To: Michal Čizmazia
Cc: Binh Nguyen Van , user
No it does not. By default, only after all the retries etc related to batch
X is done, then batch X+1 will be started.
Yes, one RDD per batch per DStream. However, the RDD could be a union of
multiple RDDs
ck of
> messages, i.e. no need to ack one-by-one, but only ack the last event in a
> batch and that would ack the entire batch.
>
> Before I commit to doing so, I'd like to know if Spark Streaming always
> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
&
ng so, I'd like to know if Spark Streaming always
processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before
RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
finished?
This is crucial to the ack logic, since if RDD2 can be potentially processed
whi
SparkContext once, as soon as you have all RDDs ready. For python it
looks this way:
rdds = []
for i in xrange(cnt):
rdd = ...
rdds.append(rdd)
finalRDD = sparkContext.union(rdds)
HTH,
Tomasz
W dniu 18.06.2015 o 02:53, Matt Forbes pisze:
I have multiple input paths which
: rdd.union(nextRdd);
rdd = rdd.coalesce(nextRdd.partitions().size());
}
Now, for a small number of inputs there doesn't seem to be a problem, but
for the full set which is about 60 sub-RDDs coming in at around 500MM total
records takes a very long time to construct. Just for a simple
load
Thanks much for the detailed explanations. I suspected architectural
support of the notion of rdd of rdds, but my understanding of Spark or
distributed computing in general is not as deep as allowing me to
understand better. so this really helps!
I ended up going with List[RDD]. The collection of
>>
>> On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar wrote:
>>
>>> Simillar question was asked before:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
>>>
>>> Here is one of the reasons why I think RDD[RDD[T]] is not pos
n or action APIs of
> RDD), it will be possible to have RDD of RDD.
>
> On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar wrote:
>
>> Simillar question was asked before:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
>>
>> Here
; http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
>
> Here is one of the reasons why I think RDD[RDD[T]] is not possible:
>
>- RDD is only a handle to the actual data partitions. It has a
>reference/pointer to the *SparkContext* object (*sc*) and a li
rk job.
Hope it helps. You need to consider List[RDD] or some other collection.
Possibly in future, if and when spark architecture allows workers to launch
spark jobs (the functions passed to transformation or action APIs of RDD),
it will be possible to have RDD of RDD.
--
View this messa
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object
ge in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-lineage-get-passed-down-in-RDDs-tp23196p23212.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e
Hi,
The problem I am looking at is as follows:
- I read in a log file of multiple users as a RDD
- I'd like to group the above RDD into *multiple RDDs* by userIds (the key)
- my processEachUser() function then takes in each RDD mapped into
each individual user, and calls for RDD.m
around RDD.
As another interest, I wanted check if some of the DF execution functions
can be executed on GPUs. For that to happen, the columnar layout is
important. Here is where DF scores over ordinary RDDs.
Seems like the batch size defined by
spark.sql.inMemoryColumnarStorage.batchSize is set to
You may refer to DataFrame Scaladoc
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Methods listed in "Language Integrated Queries" and "RDD Options" can be
viewed as "transformations", and those listed in "Actions" are, of
course, actions. As for SQLCo
I would think DF=RDD+Schema+some additional methods. In fact, a DF object
has a DF.rdd in it so you can (if needed) convert DF<=>RDD really easily.
On Mon, Jun 8, 2015 at 5:41 PM, kiran lonikar wrote:
> Thanks. Can you point me to a place in the documentation of SQL
> programming guide or DataFr
Thanks. Can you point me to a place in the documentation of SQL programming
guide or DataFrame scaladoc where this transformation and actions are
grouped like in the case of RDD?
Also if you can tell me if sqlContext.load and unionAll are transformations
or actions...
I answered a question on the
For DataFrame, there are also transformations and actions. And
transformations are also lazily evaluated. However, DataFrame
transformations like filter(), select(), agg() return a DataFrame rather
than an RDD. Other methods like show() and collect() are actions.
Cheng
On 6/8/15 1:33 PM, kira
;))
val dt = dataRDD.*zipWithUniqueId*.map(_.swap)
val newCol1 = *dt*.map {case (i, x) => (i, x(1)+x(18)) }
val newCol2 = newCol1.join(dt).map(x=> function(.))
Hope this helps.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp
,"))
> val dt = dataRDD.zipWithIndex.map(_.swap)
> val newCol1 = dataRDD.map(x => x(1)+x(18)).zipWithIndex.map(_.swap)
> val newCol2 = newCol1.join(dt).map(x=> function(.))
>
> Is there a better way of doing this?
>
> Thank you very much!
>
>
>
>
>
Thanks for replying twice :) I think I sent this question by email and
somehow thought I did not sent it, hence created the other one on the web
interface. Lets retain this thread since you have provided more details
here.
Great, it confirms my intuition about DataFrame. It's similar to Shark
colu
Interesting, just posted on another thread asking exactly the same
question :) My answer there quoted below:
> For the following code:
>
> val df = sqlContext.parquetFile(path)
>
> `df` remains columnar (actually it just reads from the columnar
Parquet file on disk). For the following code:
ze rows?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html
Sent from the Apache Spark User List mailing li
abble.com/Column-operation-on-Spark-RDDs-tp23165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
n
df.cache().map{row => ...}?
Is it a logical row which maintains an array of columns and each column in
turn is an array of values for batchSize rows?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when
When spark reads parquet files (sqlContext.parquetFile), it creates a
DataFrame RDD. I would like to know if the resulting DataFrame has columnar
structure (many rows of a column coalesced together in memory) or its a row
wise structure that a spark RDD has. The section Spark SQL and DataFrames
[]
>
>
> On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen wrote:
>
>> In the sense here, Spark actually does have operations that make multiple
>> RDDs like randomSplit. However there is not an equivalent of the partition
>> operation which gives the elements that matched and d
ordCount.scala:20 []
On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen wrote:
> In the sense here, Spark actually does have operations that make multiple
> RDDs like randomSplit. However there is not an equivalent of the partition
> operation which gives the elements that matched and did not ma
In the sense here, Spark actually does have operations that make multiple
RDDs like randomSplit. However there is not an equivalent of the partition
operation which gives the elements that matched and did not match at once.
On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang wrote:
> As far as I k
As far as I know, spark don't support multiple outputs
On Wed, Jun 3, 2015 at 2:15 PM, ayan guha wrote:
> Why do you need to do that if filter and content of the resulting rdd are
> exactly same? You may as well declare them as 1 RDD.
> On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote:
>
>> I want to
Why do you need to do that if filter and content of the resulting rdd are
exactly same? You may as well declare them as 1 RDD.
On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote:
> I want to do this
>
> val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId
> != NULL_VALUE)
>
> val
I want to do this
val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId !=
NULL_VALUE)
val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId
== NULL_VALUE)
This will run two different stages can this be done in one stage ?
val (qtSessionsWithQt, guidU
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs on
disk so that I can make sure they were persisted i
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs
on disk so that I can make sure they were persisted indeed?
You can try using ORCOutputFormat with yourRDD.saveAsNewAPIHadoopFile
Thanks
Best Regards
On Tue, Apr 14, 2015 at 9:29 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> Is it possible to store RDDs as custom output formats, For example ORC?
>
> Thanks,
> Daniel
>
Hi,
Is it possible to store RDDs as custom output formats, For example ORC?
Thanks,
Daniel
What do you mean by "permanently". If you start up the JDBC server and say
CACHE TABLE it will stay cached as long as the server is running. CACHE
TABLE is idempotent, so you could even just have that command in your BI
tools setup queries.
On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam wrote:
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a
Hive table very frequently from the BI tool.
Is there a way to cache the Hive Table permanently in SparkSQL? I don't want
to read the Hive table and cache it everytime the query is submitted from BI
tool.
Thanks!
R
case.
If that amount for data is less, you can use rdd.collect, just iterate on it
both the list and produce the desired result
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22323.html
Sent from the Apache Spark User List mailing
use zip
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e
gt;
> On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen wrote:
>
>> Hi Mark,
>>
>> That's true, but in neither way can I combine the RDDs, so I have to
>> avoid unions.
>>
>> Thanks,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra
Yang Chen writes:
> Hi Noorul,
>
> Thank you for your suggestion. I tried that, but ran out of memory. I did
> some search and found some suggestions
> that we should try to avoid rdd.union(
> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-
Kelvin
On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen wrote:
> Hi Mark,
>
> That's true, but in neither way can I combine the RDDs, so I have to avoid
> unions.
>
> Thanks,
> Yang
>
> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra
> wrote:
>
>> RDD#union is
Hi Mark,
That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.
Thanks,
Yang
On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra
wrote:
> RDD#union is not the same thing as SparkContext#union
>
> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen wrote:
d.union(
> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
> ).
> I will try to come up with some other ways.
>
> Thank you,
> Yang
>
> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M
> wrote:
>
>> s
Hi Noorul,
Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try
sparkx writes:
> Hi,
>
> I have a Spark job and a dataset of 0.5 Million items. Each item performs
> some sort of computation (joining a shared external dataset, if that does
> matter) and produces an RDD containing 20-500 result items. Now I would like
> to combine all these
Hi,
I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs and perform a next job. What I have found
"\n") fw.close() } })
Sending from cellphone, not sure how the code snippet will look. :)
On 26 Mar 2015 01:20, "Adrian Mocanu" wrote:
> Hi
>
> Is there a way to write all RDDs in a DStream to the same file?
>
> I tried this and got an empty file. I think
Hi
Is there a way to write all RDDs in a DStream to the same file?
I tried this and got an empty file. I think it's bc the file is not closed i.e.
ESMinibatchFunctions.writer.close() executes before the stream is created.
Here's my code
myStream.foreachRDD(rdd => {
What's the best way to go from:
RDD[(A, B)] to (RDD[A], RDD[B])
If I do:
def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2))
Which is the obvious solution, this runs two maps in the cluster. Can I do
some kind of a fold instead:
def separate[A, B](l: List[(A, B)]) = l.foldLeft(Li
For those still interested, I raised this issue on JIRA and received an
official response:
https://issues.apache.org/jira/browse/SPARK-6340
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html
Sent from the Apache
this
> issue whilst experimenting with feature extraction for text classification,
> where (correct me if I'm wrong) there is no built-in mechanism to keep track
> of document-ids through the HashingTF and IDF fitting and transformations.
>
> Thanks.
>
>
>
> --
> View
eservation-with-RDDs-tp22052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
201 - 300 of 595 matches
Mail list logo