Apologies
I accidentally included Spark User DL on BCC. The actual email message is
below.
=
Hi:
I have been working on few example using zeppelin.
I have been trying to find a command that would list all *dataframes/RDDs
I am not sure if you can view all RDDs in a session. Tables are maintained
in a catalogue . Hence its easier. However you can see the DAG
representation , which lists all the RDDs in a job , with Spark UI.
On 20 Aug 2015 22:34, Dhaval Patel dhaval1...@gmail.com wrote:
Apologies
I
Hi:
I have been working on few example using zeppelin.
I have been trying to find a command that would list all *dataframes/RDDs*
that has been created in current session. Anyone knows if there is any such
commands available?
Something similar to SparkSQL to list all temp tables :
show
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA
to add this feature and may be in future release it could be added.
Thanks
Best Regards
On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk
wrote:
Hi,
I am currently working on the latest version
, Matthew O'Reilly moreill...@qub.ac.uk a
écrit :
Hi,
I am currently working on the latest version of Apache Spark (1.4.1),
pre-built package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
(something similar is Altibase's HDB:
http://altibase.com
Hi,
I am currently working on the latest version of Apache Spark (1.4.1), pre-built
package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
(something similar is Altibase's HDB:
http://altibase.com/in-memory-database-computing-solutions/security
-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p24049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
Hi Stahlman,
finalRDDStorageLevel is the storage level for the final user/item
factors. It is not common to set it to StorageLevel.NONE, unless you
want to save the factors directly to disk. So if it is NONE, we cannot
unpersist the intermediate RDDs (in/out blocks) because the final
user/item
, IASIB1 moreill...@qub.ac.uk wrote:
I am currently working on the latest version of Apache Spark (1.4.1),
pre-built package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory
(similarly
to Altibase's HDB:
http://altibase.com/in-memory-database-computing
I am currently working on the latest version of Apache Spark (1.4.1),
pre-built package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly
to Altibase's HDB:
http://altibase.com/in-memory-database-computing-solutions/security/
http://altibase.com
Hi Burak,
Looking at the source code, the intermediate RDDs used in ALS.train() are
persisted during the computation using intermediateRDDStorageLevel (default
value is StorageLevel.MEMORY_AND_DISK) - see
herehttps://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml
at 10:38 AM, Stahlman, Jonathan
jonathan.stahl...@capitalone.com wrote:
Hello again,
In trying to understand the caching of intermediate RDDs by ALS, I looked
into the source code and found what may be a bug. Looking here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala
Hi,
We have a requirement wherein we need to keep RDDs in memory between Spark
batch processing that happens every one hour. The idea here is to have RDDs
that have active user sessions in memory between two jobs so that once a job
processing is done and another job is run after an hour the RDDs
...@capitalone.com]
Sent: Wednesday, July 22, 2015 01:42 PM Eastern Standard Time
To: user@spark.apache.org
Subject: Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel
Hello again,
In trying to understand the caching of intermediate RDDs by ALS, I looked into
the source code
do a join
On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote:
Hi, All,
I am trying to access a Map from RDDs that are on different compute
nodes, but without success. The Map is like:
val map1 = Map(aa-1,bb-2,cc-3,...)
All RDDs will have to check against it to see if the key
a
join
On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote:
Hi, All,
I am trying to access a Map from RDDs that are on different compute
nodes, but without success. The Map is like:
val map1 = Map(aa-1,bb-2,cc-3,...)
All RDDs will have to check against it to see if the key is in the Map
to access a Map from RDDs that are on different compute
nodes, but without success. The Map is like:
val map1 = Map(aa-1,bb-2,cc-3,...)
All RDDs will have to check against it to see if the key is in the Map
or not, so seems I have to make the Map itself global, the problem is that
if the Map
Tachyon is one way. Also check out the Spark Job Server
https://github.com/spark-jobserver/spark-jobserver .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957p23958.html
Sent from
I was about say whatever the previous post said,so +1 to the previous
post,from my understanding (gut feeling) of your requirement it very easy to
do this with spark-job-server.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory
Hello again,
In trying to understand the caching of intermediate RDDs by ALS, I looked into
the source code and found what may be a bug. Looking here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L230
you see that ALS.train
Actually, I should clarify - Tachyon is a way to keep your data in RAM, but
it's not exactly the same as keeping it cached in Spark. Spark Job Server
is a way to keep it cached in Spark.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs
intensive.
-Andrew
2015-07-21 13:47 GMT-07:00 wdbaruni wdbar...@gmail.com:
I am new to Spark and I understand that Spark divides the executor memory
into the following fractions:
*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
.cache() and can be defined by setting
/examples/BroadcastTest.scala
.
-Andrew
2015-07-21 19:56 GMT-07:00 ayan guha guha.a...@gmail.com:
Either you have to do rdd.collect and then broadcast or you can do a join
On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote:
Hi, All,
I am trying to access a Map from RDDs
Yes. Tachyon can handle this well: http://tachyon-project.org/
Best,
Haoyuan
On Wed, Jul 22, 2015 at 10:56 AM, swetha swethakasire...@gmail.com wrote:
Hi,
We have a requirement wherein we need to keep RDDs in memory between Spark
batch processing that happens every one hour. The idea here
Either you have to do rdd.collect and then broadcast or you can do a join
On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote:
Hi, All,
I am trying to access a Map from RDDs that are on different compute nodes,
but without success. The Map is like:
val map1 = Map(aa-1,bb-2,cc-3
Hi, All,
I am trying to access a Map from RDDs that are on different compute nodes,
but without success. The Map is like:
val map1 = Map(aa-1,bb-2,cc-3,...)
All RDDs will have to check against it to see if the key is in the Map or
not, so seems I have to make the Map itself global, the problem
I am new to Spark and I understand that Spark divides the executor memory
into the following fractions:
*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
.cache() and can be defined by setting spark.storage.memoryFraction (default
0.6)
*Shuffle and aggregation buffers
sc.union(rdds).saveAsTextFile()
On Wed, Jul 15, 2015 at 10:37 PM, Brandon White bwwintheho...@gmail.com wrote:
Hello,
I have a list of rdds
List(rdd1, rdd2, rdd3,rdd4)
I would like to save these rdds in parallel. Right now, it is running each
operation sequentially. I tried using a rdd
. A sample code in python is
copied below.
The issue I have is that each new model which is trained caches a set of RDDs
and eventually the executors run out of memory. Is there any way in Pyspark to
unpersist() these RDDs after each iteration? The names of the RDDs which I
gather from the UI
, is it possible to essentially group all of the RDDs from the batch
into a single RDD and single partition and therefore operate on all of the
elements in the batch at once?
My goal here is to do an operation exactly once for every batch. As I
understand it, foreachRDD is going to do
, is it possible to essentially group all of the RDDs from the batch
into a single RDD and single partition and therefore operate on all of the
elements in the batch at once?
My goal here is to do an operation exactly once for every batch. As I
understand it, foreachRDD is going to do the operation once
) - {
log.info(processing RDD from batch {}, batchTime);
// my rdd processing code
});
Instead of having my rdd processing code called once for each RDD in the
batch, is it possible to essentially group all of the RDDs from
Hello,
I have a list of rdds
List(rdd1, rdd2, rdd3,rdd4)
I would like to save these rdds in parallel. Right now, it is running each
operation sequentially. I tried using a rdd of rdd but that does not work.
list.foreach { rdd =
rdd.saveAsTextFile(/tmp/cache/)
}
Any ideas?
rdd processing code called once for each RDD in the
batch, is it possible to essentially group all of the RDDs from the batch
into a single RDD and single partition and therefore operate on all of the
elements in the batch at once?
My goal here is to do an operation exactly once for every batch
this working?
I'm using sbt assembly to try to compile these files, and would really
appreciate any help.
Thanks,
Ashley Wang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/correct-Scala-Imports-for-creating-DFs-from-RDDs-tp23829.html
Sent from
-Imports-for-creating-DFs-from-RDDs-tp23829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
will be started.
Yes, one RDD per batch per DStream. However, the RDD could be a union of
multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
DStream).
TD
On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia mici...@gmail.com
wrote:
Thanks Tathagata!
I will use *foreachRDD
of
multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
DStream).
TD
On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia mici...@gmail.com
wrote:
Thanks Tathagata!
I will use *foreachRDD*/*foreachPartition*() instead of *trasform*()
then.
Does the default scheduler initiate
RDD per batch per DStream. However, the RDD could be a union of
multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
DStream).
TD
On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia mici...@gmail.com wrote:
Thanks Tathagata!
I will use *foreachRDD*/*foreachPartition*() instead
, I'd like to know if Spark Streaming always
processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before
RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
finished?
This is crucial to the ack logic, since if RDD2 can be potentially processed
while RDD1
of
messages, i.e. no need to ack one-by-one, but only ack the last event in a
batch and that would ack the entire batch.
Before I commit to doing so, I'd like to know if Spark Streaming always
processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
before
RDD2, is it true that RDD2
SparkContext once, as soon as you have all RDDs ready. For python it
looks this way:
rdds = []
for i in xrange(cnt):
rdd = ...
rdds.append(rdd)
finalRDD = sparkContext.union(rdds)
HTH,
Tomasz
W dniu 18.06.2015 o 02:53, Matt Forbes pisze:
I have multiple input paths which
) ? nextRdd : rdd.union(nextRdd);
rdd = rdd.coalesce(nextRdd.partitions().size());
}
Now, for a small number of inputs there doesn't seem to be a problem, but
for the full set which is about 60 sub-RDDs coming in at around 500MM total
records takes a very long time to construct. Just for a simple
load
Thanks much for the detailed explanations. I suspected architectural
support of the notion of rdd of rdds, but my understanding of Spark or
distributed computing in general is not as deep as allowing me to
understand better. so this really helps!
I ended up going with List[RDD]. The collection
before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object (*sc*) and a list
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object
, if and when spark architecture allows workers to launch
spark jobs (the functions passed to transformation or action APIs of RDD),
it will be possible to have RDD of RDD.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-tp17025p23217.html
Sent
or action APIs of
RDD), it will be possible to have RDD of RDD.
On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote:
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
Here is one of the reasons why I think RDD
, kiran lonikar loni...@gmail.com wrote:
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
- RDD is only a handle to the actual data partitions. It has
Hi,
The problem I am looking at is as follows:
- I read in a log file of multiple users as a RDD
- I'd like to group the above RDD into *multiple RDDs* by userIds (the key)
- my processEachUser() function then takes in each RDD mapped into
each individual user, and calls for RDD.map
://apache-spark-user-list.1001560.n3.nabble.com/How-does-lineage-get-passed-down-in-RDDs-tp23196p23212.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
*.map(_.swap)
val newCol1 = *dt*.map {case (i, x) = (i, x(1)+x(18)) }
val newCol2 = newCol1.join(dt).map(x= function(.))
Hope this helps.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165p23203.html
Sent from
For DataFrame, there are also transformations and actions. And
transformations are also lazily evaluated. However, DataFrame
transformations like filter(), select(), agg() return a DataFrame rather
than an RDD. Other methods like show() and collect() are actions.
Cheng
On 6/8/15 1:33 PM,
newCol2 = newCol1.join(dt).map(x= function(.))
Is there a better way of doing this?
Thank you very much!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html
Sent from the Apache Spark User List mailing
I would think DF=RDD+Schema+some additional methods. In fact, a DF object
has a DF.rdd in it so you can (if needed) convert DF=RDD really easily.
On Mon, Jun 8, 2015 at 5:41 PM, kiran lonikar loni...@gmail.com wrote:
Thanks. Can you point me to a place in the documentation of SQL
programming
Thanks. Can you point me to a place in the documentation of SQL programming
guide or DataFrame scaladoc where this transformation and actions are
grouped like in the case of RDD?
Also if you can tell me if sqlContext.load and unionAll are transformations
or actions...
I answered a question on
You may refer to DataFrame Scaladoc
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Methods listed in Language Integrated Queries and RDD Options can be
viewed as transformations, and those listed in Actions are, of
course, actions. As for
.
As another interest, I wanted check if some of the DF execution functions
can be executed on GPUs. For that to happen, the columnar layout is
important. Here is where DF scores over ordinary RDDs.
Seems like the batch size defined by
spark.sql.inMemoryColumnarStorage.batchSize is set to a default size
this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Interesting, just posted on another thread asking exactly the same
question :) My answer there quoted below:
For the following code:
val df = sqlContext.parquetFile(path)
`df` remains columnar (actually it just reads from the columnar
Parquet file on disk). For the following code:
Thanks for replying twice :) I think I sent this question by email and
somehow thought I did not sent it, hence created the other one on the web
interface. Lets retain this thread since you have provided more details
here.
Great, it confirms my intuition about DataFrame. It's similar to Shark
-on-Spark-RDDs-tp23165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
As far as I know, spark don't support multiple outputs
On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote:
Why do you need to do that if filter and content of the resulting rdd are
exactly same? You may as well declare them as 1 RDD.
On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏)
at 2:45 PM, Sean Owen so...@cloudera.com wrote:
In the sense here, Spark actually does have operations that make multiple
RDDs like randomSplit. However there is not an equivalent of the partition
operation which gives the elements that matched and did not match at once.
On Wed, Jun 3, 2015, 8
In the sense here, Spark actually does have operations that make multiple
RDDs like randomSplit. However there is not an equivalent of the partition
operation which gives the elements that matched and did not match at once.
On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote:
As far
Why do you need to do that if filter and content of the resulting rdd are
exactly same? You may as well declare them as 1 RDD.
On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I want to do this
val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId
!=
When spark reads parquet files (sqlContext.parquetFile), it creates a
DataFrame RDD. I would like to know if the resulting DataFrame has columnar
structure (many rows of a column coalesced together in memory) or its a row
wise structure that a spark RDD has. The section Spark SQL and DataFrames
= ...}?
Is it a logical row which maintains an array of columns and each column in
turn is an array of values for batchSize rows?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet
I want to do this
val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId !=
NULL_VALUE)
val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId
== NULL_VALUE)
This will run two different stages can this be done in one stage ?
val (qtSessionsWithQt,
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs
on disk so that I can make sure they were persisted indeed?
Thank
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs on
disk so that I can make sure they were persisted indeed
You can try using ORCOutputFormat with yourRDD.saveAsNewAPIHadoopFile
Thanks
Best Regards
On Tue, Apr 14, 2015 at 9:29 PM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
Is it possible to store RDDs as custom output formats, For example ORC?
Thanks,
Daniel
Hi,
Is it possible to store RDDs as custom output formats, For example ORC?
Thanks,
Daniel
What do you mean by permanently. If you start up the JDBC server and say
CACHE TABLE it will stay cached as long as the server is running. CACHE
TABLE is idempotent, so you could even just have that command in your BI
tools setup queries.
On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a
Hive table very frequently from the BI tool.
Is there a way to cache the Hive Table permanently in SparkSQL? I don't want
to read the Hive table and cache it everytime the query is submitted from BI
tool.
Thanks!
case.
If that amount for data is less, you can use rdd.collect, just iterate on it
both the list and produce the desired result
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22323.html
Sent from the Apache Spark User List mailing
use zip
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e
,
That's true, but in neither way can I combine the RDDs, so I have to
avoid unions.
Thanks,
Yang
On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:
RDD#union is not the same thing as SparkContext#union
On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com
Yang Chen y...@yang-cs.com writes:
Hi Noorul,
Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds
sparkx y...@yang-cs.com writes:
Hi,
I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs
Hi,
I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs and perform a next job. What I have found
On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote:
Hi Mark,
That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.
Thanks,
Yang
On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:
RDD#union is not the same thing
://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.
Thank you,
Yang
On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
wrote:
sparkx y...@yang-cs.com writes:
Hi
Hi Mark,
That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.
Thanks,
Yang
On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:
RDD#union is not the same thing as SparkContext#union
On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang
, not sure how the code snippet will look. :)
On 26 Mar 2015 01:20, Adrian Mocanu amoc...@verticalscope.com wrote:
Hi
Is there a way to write all RDDs in a DStream to the same file?
I tried this and got an empty file. I think it’s bc the file is not closed
i.e
Hi
Is there a way to write all RDDs in a DStream to the same file?
I tried this and got an empty file. I think it's bc the file is not closed i.e.
ESMinibatchFunctions.writer.close() executes before the stream is created.
Here's my code
myStream.foreachRDD(rdd = {
rdd.foreach(x
What's the best way to go from:
RDD[(A, B)] to (RDD[A], RDD[B])
If I do:
def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2))
Which is the obvious solution, this runs two maps in the cluster. Can I do
some kind of a fold instead:
def separate[A, B](l: List[(A, B)]) =
For those still interested, I raised this issue on JIRA and received an
official response:
https://issues.apache.org/jira/browse/SPARK-6340
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html
Sent from the Apache
,
where (correct me if I'm wrong) there is no built-in mechanism to keep track
of document-ids through the HashingTF and IDF fitting and transformations.
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
-with-RDDs-tp22052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
this would be true for *any* transformation which causes a shuffle.
It would not be true if you're combining RDDs with union, since that
doesn't cause a shuffle.
On Thu, Mar 12, 2015 at 11:04 AM, shahab shahab.mok...@gmail.com
javascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com'); wrote:
Hi
Join causes a shuffle (sending data across the network). I expect it will
be better to filter before you join, so you reduce the amount of data which
is sent across the network.
Note this would be true for *any* transformation which causes a shuffle. It
would not be true if you're combining RDDs
Hi,
It's been some time since my last message on the subject of using many RDDs
in a Spark job, but I have just encountered the same problem again. The
thing it's that I have an RDD of time tagged data, that I want to 1) divide
into windows according to a timestamp field; 2) compute KMeans
= line.contains(b))
println(Lines with b: %s.format(numBs.count))
}
}
})
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
= line.contains(b))
println(Lines with b: %s.format(numBs.count))
}
}
})
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
= line.contains(b))
println(Lines with b: %s.format(numBs.count))
}
}
})
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
:_e(%7B%7D,'cvml','nightwolf...@gmail.com'); wrote:
Hi all,
I'd like to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used
SQL and is used by default
when you run .cache on a SchemaRDD or CACHE TABLE.
I'd also look at parquet which is more efficient and handles nested data
better.
On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com
wrote:
Hi all,
I'd like to build/use column oriented RDDs
As you suggested, I tried to save the grouped RDD and persisted it in
memory before the iterations begin. The performance seems to be much better
now.
My previous comment that the run times doubled was from a wrong observation.
Thanks.
On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan
Thanks.
I tried persist() on the RDD. The runtimes appear to have doubled now
(without persist() it was ~7s per iteration and now its ~15s). I am running
standalone Spark on a 8-core machine.
Any thoughts on why the increase in runtime?
On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid
201 - 300 of 542 matches
Mail list logo