Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Apologies I accidentally included Spark User DL on BCC. The actual email message is below. = Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Rishitesh Mishra
I am not sure if you can view all RDDs in a session. Tables are maintained in a catalogue . Hence its easier. However you can see the DAG representation , which lists all the RDDs in a job , with Spark UI. On 20 Aug 2015 22:34, Dhaval Patel dhaval1...@gmail.com wrote: Apologies I

How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that has been created in current session. Anyone knows if there is any such commands available? Something similar to SparkSQL to list all temp tables : show

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA to add this feature and may be in future release it could be added. Thanks Best Regards On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk wrote: Hi, I am currently working on the latest version

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
, Matthew O'Reilly moreill...@qub.ac.uk a écrit : Hi, I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache (something similar is Altibase's HDB: http://altibase.com

Encryption on RDDs or in-memory/cache on Apache Spark

2015-07-31 Thread Matthew O'Reilly
Hi, I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache (something similar is Altibase's HDB:  http://altibase.com/in-memory-database-computing-solutions/security

Re: RDDs join problem: incorrect result

2015-07-29 Thread ๏̯͡๏
-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p24049.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-28 Thread Xiangrui Meng
Hi Stahlman, finalRDDStorageLevel is the storage level for the final user/item factors. It is not common to set it to StorageLevel.NONE, unless you want to save the factors directly to disk. So if it is NONE, we cannot unpersist the intermediate RDDs (in/out blocks) because the final user/item

Re: Encryption on RDDs or in-memory on Apache Spark

2015-07-27 Thread Akhil Das
, IASIB1 moreill...@qub.ac.uk wrote: I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly to Altibase's HDB: http://altibase.com/in-memory-database-computing

Encryption on RDDs or in-memory on Apache Spark

2015-07-24 Thread IASIB1
I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly to Altibase's HDB: http://altibase.com/in-memory-database-computing-solutions/security/ http://altibase.com

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Stahlman, Jonathan
Hi Burak, Looking at the source code, the intermediate RDDs used in ALS.train() are persisted during the computation using intermediateRDDStorageLevel (default value is StorageLevel.MEMORY_AND_DISK) - see herehttps://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Burak Yavuz
at 10:38 AM, Stahlman, Jonathan jonathan.stahl...@capitalone.com wrote: Hello again, In trying to understand the caching of intermediate RDDs by ALS, I looked into the source code and found what may be a bug. Looking here: https://github.com/apache/spark/blob/master/mllib/src/main/scala

How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread swetha
Hi, We have a requirement wherein we need to keep RDDs in memory between Spark batch processing that happens every one hour. The idea here is to have RDDs that have active user sessions in memory between two jobs so that once a job processing is done and another job is run after an hour the RDDs

RE: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Ganelin, Ilya
...@capitalone.com] Sent: Wednesday, July 22, 2015 01:42 PM Eastern Standard Time To: user@spark.apache.org Subject: Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel Hello again, In trying to understand the caching of intermediate RDDs by ALS, I looked into the source code

Re: How to share a Map among RDDS?

2015-07-22 Thread Dan Dong
do a join On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote: Hi, All, I am trying to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map(aa-1,bb-2,cc-3,...) All RDDs will have to check against it to see if the key

Re: How to share a Map among RDDS?

2015-07-22 Thread Dan Dong
a join On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote: Hi, All, I am trying to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map(aa-1,bb-2,cc-3,...) All RDDs will have to check against it to see if the key is in the Map

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map(aa-1,bb-2,cc-3,...) All RDDs will have to check against it to see if the key is in the Map or not, so seems I have to make the Map itself global, the problem is that if the Map

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Tachyon is one way. Also check out the Spark Job Server https://github.com/spark-jobserver/spark-jobserver . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957p23958.html Sent from

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread harirajaram
I was about say whatever the previous post said,so +1 to the previous post,from my understanding (gut feeling) of your requirement it very easy to do this with spark-job-server. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Stahlman, Jonathan
Hello again, In trying to understand the caching of intermediate RDDs by ALS, I looked into the source code and found what may be a bug. Looking here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L230 you see that ALS.train

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Actually, I should clarify - Tachyon is a way to keep your data in RAM, but it's not exactly the same as keeping it cached in Spark. Spark Job Server is a way to keep it cached in Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
intensive. -Andrew 2015-07-21 13:47 GMT-07:00 wdbaruni wdbar...@gmail.com: I am new to Spark and I understand that Spark divides the executor memory into the following fractions: *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
/examples/BroadcastTest.scala . -Andrew 2015-07-21 19:56 GMT-07:00 ayan guha guha.a...@gmail.com: Either you have to do rdd.collect and then broadcast or you can do a join On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote: Hi, All, I am trying to access a Map from RDDs

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread Haoyuan Li
Yes. Tachyon can handle this well: http://tachyon-project.org/ Best, Haoyuan On Wed, Jul 22, 2015 at 10:56 AM, swetha swethakasire...@gmail.com wrote: Hi, We have a requirement wherein we need to keep RDDs in memory between Spark batch processing that happens every one hour. The idea here

Re: How to share a Map among RDDS?

2015-07-21 Thread ayan guha
Either you have to do rdd.collect and then broadcast or you can do a join On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote: Hi, All, I am trying to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map(aa-1,bb-2,cc-3

How to share a Map among RDDS?

2015-07-21 Thread Dan Dong
Hi, All, I am trying to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map(aa-1,bb-2,cc-3,...) All RDDs will have to check against it to see if the key is in the Map or not, so seems I have to make the Map itself global, the problem

Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-21 Thread wdbaruni
I am new to Spark and I understand that Spark divides the executor memory into the following fractions: *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6) *Shuffle and aggregation buffers

Re: Running foreach on a list of rdds in parallel

2015-07-16 Thread Davies Liu
sc.union(rdds).saveAsTextFile() On Wed, Jul 15, 2015 at 10:37 PM, Brandon White bwwintheho...@gmail.com wrote: Hello, I have a list of rdds List(rdd1, rdd2, rdd3,rdd4) I would like to save these rdds in parallel. Right now, it is running each operation sequentially. I tried using a rdd

How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-16 Thread Stahlman, Jonathan
. A sample code in python is copied below. The issue I have is that each new model which is trained caches a set of RDDs and eventually the executors run out of memory. Is there any way in Pyspark to unpersist() these RDDs after each iteration? The names of the RDDs which I gather from the UI

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread N B
, is it possible to essentially group all of the RDDs from the batch into a single RDD and single partition and therefore operate on all of the elements in the batch at once? My goal here is to do an operation exactly once for every batch. As I understand it, foreachRDD is going to do

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
, is it possible to essentially group all of the RDDs from the batch into a single RDD and single partition and therefore operate on all of the elements in the batch at once? My goal here is to do an operation exactly once for every batch. As I understand it, foreachRDD is going to do the operation once

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Ted Yu
) - { log.info(processing RDD from batch {}, batchTime); // my rdd processing code }); Instead of having my rdd processing code called once for each RDD in the batch, is it possible to essentially group all of the RDDs from

Running foreach on a list of rdds in parallel

2015-07-15 Thread Brandon White
Hello, I have a list of rdds List(rdd1, rdd2, rdd3,rdd4) I would like to save these rdds in parallel. Right now, it is running each operation sequentially. I tried using a rdd of rdd but that does not work. list.foreach { rdd = rdd.saveAsTextFile(/tmp/cache/) } Any ideas?

Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
rdd processing code called once for each RDD in the batch, is it possible to essentially group all of the RDDs from the batch into a single RDD and single partition and therefore operate on all of the elements in the batch at once? My goal here is to do an operation exactly once for every batch

Re: correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread DW @ Gmail
this working? I'm using sbt assembly to try to compile these files, and would really appreciate any help. Thanks, Ashley Wang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/correct-Scala-Imports-for-creating-DFs-from-RDDs-tp23829.html Sent from

correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread ashwang168
-Imports-for-creating-DFs-from-RDDs-tp23829.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
will be started. Yes, one RDD per batch per DStream. However, the RDD could be a union of multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned DStream). TD On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia mici...@gmail.com wrote: Thanks Tathagata! I will use *foreachRDD

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Khaled Hammouda
of multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned DStream). TD On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia mici...@gmail.com wrote: Thanks Tathagata! I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then. Does the default scheduler initiate

Re: Are Spark Streaming RDDs always processed in order?

2015-07-04 Thread Michal Čizmazia
RDD per batch per DStream. However, the RDD could be a union of multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned DStream). TD On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia mici...@gmail.com wrote: Thanks Tathagata! I will use *foreachRDD*/*foreachPartition*() instead

Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread khaledh
, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is finished? This is crucial to the ack logic, since if RDD2 can be potentially processed while RDD1

Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch. Before I commit to doing so, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2

Re: Union of many RDDs taking a long time

2015-06-29 Thread Tomasz Fruboes
SparkContext once, as soon as you have all RDDs ready. For python it looks this way: rdds = [] for i in xrange(cnt): rdd = ... rdds.append(rdd) finalRDD = sparkContext.union(rdds) HTH, Tomasz W dniu 18.06.2015 o 02:53, Matt Forbes pisze: I have multiple input paths which

Union of many RDDs taking a long time

2015-06-17 Thread Matt Forbes
) ? nextRdd : rdd.union(nextRdd); rdd = rdd.coalesce(nextRdd.partitions().size()); } Now, for a small number of inputs there doesn't seem to be a problem, but for the full set which is about 60 sub-RDDs coming in at around 500MM total records takes a very long time to construct. Just for a simple load

Re: RDD of RDDs

2015-06-10 Thread ping yan
Thanks much for the detailed explanations. I suspected architectural support of the notion of rdd of rdds, but my understanding of Spark or distributed computing in general is not as deep as allowing me to understand better. so this really helps! I ended up going with List[RDD]. The collection

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object (*sc*) and a list

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object

Re: Rdd of Rdds

2015-06-09 Thread lonikar
, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-tp17025p23217.html Sent

Re: RDD of RDDs

2015-06-09 Thread Mark Hamstra
or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote: Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
, kiran lonikar loni...@gmail.com wrote: Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has

RDD of RDDs

2015-06-08 Thread ping yan
Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map

Re: How does lineage get passed down in RDDs

2015-06-08 Thread maxdml
://apache-spark-user-list.1001560.n3.nabble.com/How-does-lineage-get-passed-down-in-RDDs-tp23196p23212.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Column operation on Spark RDDs.

2015-06-08 Thread lonikar
*.map(_.swap) val newCol1 = *dt*.map {case (i, x) = (i, x(1)+x(18)) } val newCol2 = newCol1.join(dt).map(x= function(.)) Hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165p23203.html Sent from

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
For DataFrame, there are also transformations and actions. And transformations are also lazily evaluated. However, DataFrame transformations like filter(), select(), agg() return a DataFrame rather than an RDD. Other methods like show() and collect() are actions. Cheng On 6/8/15 1:33 PM,

Re: Column operation on Spark RDDs.

2015-06-08 Thread kiran lonikar
newCol2 = newCol1.join(dt).map(x= function(.)) Is there a better way of doing this? Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html Sent from the Apache Spark User List mailing

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread ayan guha
I would think DF=RDD+Schema+some additional methods. In fact, a DF object has a DF.rdd in it so you can (if needed) convert DF=RDD really easily. On Mon, Jun 8, 2015 at 5:41 PM, kiran lonikar loni...@gmail.com wrote: Thanks. Can you point me to a place in the documentation of SQL programming

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
Thanks. Can you point me to a place in the documentation of SQL programming guide or DataFrame scaladoc where this transformation and actions are grouped like in the case of RDD? Also if you can tell me if sqlContext.load and unionAll are transformations or actions... I answered a question on

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
You may refer to DataFrame Scaladoc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame Methods listed in Language Integrated Queries and RDD Options can be viewed as transformations, and those listed in Actions are, of course, actions. As for

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
. As another interest, I wanted check if some of the DF execution functions can be executed on GPUs. For that to happen, the columnar layout is important. Here is where DF scores over ordinary RDDs. Seems like the batch size defined by spark.sql.inMemoryColumnarStorage.batchSize is set to a default size

Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-07 Thread Cheng Lian
this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread Cheng Lian
Interesting, just posted on another thread asking exactly the same question :) My answer there quoted below: For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code:

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark

Column operation on Spark RDDs.

2015-06-04 Thread Carter
-on-Spark-RDDs-tp23165.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
As far as I know, spark don't support multiple outputs On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote: Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
at 2:45 PM, Sean Owen so...@cloudera.com wrote: In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Sean Owen
In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote: As far

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread ayan guha
Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId !=

columnar structure of RDDs from Parquet or ORC files

2015-06-03 Thread kiran lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames

Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-03 Thread lonikar
= ...}? Is it a logical row which maintains an array of columns and each column in turn is an array of values for batchSize rows? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet

Filter operation to return two RDDs at once.

2015-06-02 Thread ๏̯͡๏
I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt,

Where does Spark persist RDDs on disk?

2015-05-05 Thread Haoliang Quan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed? Thank

Where does Spark persist RDDs on disk?

2015-05-05 Thread hquan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed

Re: Saving RDDs as custom output format

2015-04-15 Thread Akhil Das
You can try using ORCOutputFormat with yourRDD.saveAsNewAPIHadoopFile Thanks Best Regards On Tue, Apr 14, 2015 at 9:29 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Saving RDDs as custom output format

2015-04-14 Thread Daniel Haviv
Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Re: SparkSQL - Caching RDDs

2015-04-01 Thread Michael Armbrust
What do you mean by permanently. If you start up the JDBC server and say CACHE TABLE it will stay cached as long as the server is running. CACHE TABLE is idempotent, so you could even just have that command in your BI tools setup queries. On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam

SparkSQL - Caching RDDs

2015-04-01 Thread Venkat, Ankam
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a Hive table very frequently from the BI tool. Is there a way to cache the Hive Table permanently in SparkSQL? I don't want to read the Hive table and cache it everytime the query is submitted from BI tool. Thanks!

Re: can't union two rdds

2015-03-31 Thread ankurjain.nitrr
case. If that amount for data is less, you can use rdd.collect, just iterate on it both the list and produce the desired result -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22323.html Sent from the Apache Spark User List mailing

Re: can't union two rdds

2015-03-31 Thread roy
use zip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e

Re: Combining Many RDDs

2015-03-27 Thread Yang Chen
, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
Yang Chen y...@yang-cs.com writes: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs

Combining Many RDDs

2015-03-26 Thread sparkx
Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found

Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu
On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote: Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing

Re: Combining Many RDDs

2015-03-26 Thread Mark Hamstra
://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways. Thank you, Yang On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote: sparkx y...@yang-cs.com writes: Hi

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang

Re: writing DStream RDDs to the same file

2015-03-26 Thread Akhil Das
, not sure how the code snippet will look. :) On 26 Mar 2015 01:20, Adrian Mocanu amoc...@verticalscope.com wrote: Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it’s bc the file is not closed i.e

writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it's bc the file is not closed i.e. ESMinibatchFunctions.writer.close() executes before the stream is created. Here's my code myStream.foreachRDD(rdd = { rdd.foreach(x

RDD pair to pair of RDDs

2015-03-18 Thread Alex Turner (TMS)
What's the best way to go from: RDD[(A, B)] to (RDD[A], RDD[B]) If I do: def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2)) Which is the obvious solution, this runs two maps in the cluster. Can I do some kind of a fold instead: def separate[A, B](l: List[(A, B)]) =

Re: order preservation with RDDs

2015-03-16 Thread kian.ho
For those still interested, I raised this issue on JIRA and received an official response: https://issues.apache.org/jira/browse/SPARK-6340 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html Sent from the Apache

Re: order preservation with RDDs

2015-03-15 Thread Sean Owen
, where (correct me if I'm wrong) there is no built-in mechanism to keep track of document-ids through the HashingTF and IDF fitting and transformations. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html

order preservation with RDDs

2015-03-14 Thread kian.ho
-with-RDDs-tp22052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-13 Thread shahab
this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs with union, since that doesn't cause a shuffle. On Thu, Mar 12, 2015 at 11:04 AM, shahab shahab.mok...@gmail.com javascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com'); wrote: Hi

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread Daniel Siegmann
Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs

Re: Is there a limit to the number of RDDs in a Spark context?

2015-03-12 Thread Juan Rodríguez Hortalá
Hi, It's been some time since my last message on the subject of using many RDDs in a Spark job, but I have just encountered the same problem again. The thing it's that I have an RDD of time tagged data, that I want to 1) divide into windows according to a timestamp field; 2) compute KMeans

Re: RDDs

2015-03-03 Thread Manas Kar
= line.contains(b)) println(Lines with b: %s.format(numBs.count)) } } }) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: RDDs

2015-03-03 Thread Kartheek.R
= line.contains(b)) println(Lines with b: %s.format(numBs.count)) } } }) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: RDDs

2015-03-03 Thread Manas Kar
= line.contains(b)) println(Lines with b: %s.format(numBs.count)) } } }) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Columnar-Oriented RDDs

2015-03-01 Thread Night Wolf
:_e(%7B%7D,'cvml','nightwolf...@gmail.com'); wrote: Hi all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used

Re: Columnar-Oriented RDDs

2015-03-01 Thread Koert Kuipers
SQL and is used by default when you run .cache on a SchemaRDD or CACHE TABLE. I'd also look at parquet which is more efficient and handles nested data better. On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com wrote: Hi all, I'd like to build/use column oriented RDDs

Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
As you suggested, I tried to save the grouped RDD and persisted it in memory before the iterations begin. The performance seems to be much better now. My previous comment that the run times doubled was from a wrong observation. Thanks. On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan

Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
Thanks. I tried persist() on the RDD. The runtimes appear to have doubled now (without persist() it was ~7s per iteration and now its ~15s). I am running standalone Spark on a 8-core machine. Any thoughts on why the increase in runtime? On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid

<    1   2   3   4   5   6   >