Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Artemis User
age-for-spark-rdds> Regards, - Abhishek Shakya Senior Data Scientist 1, Contact: +919002319890 | Email ID: abhishek.sha...@aganitha.ai <mailto:abhishek.sha...@aganitha.ai> Aganitha Cognitive Solutions <https://aganitha.ai/>

Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Sean Owen
spark-rapids is not part of Spark, so couldn't speak to it, but Spark itself does not use GPUs at all. It does let you configure a task to request a certain number of GPUs, and that would work for RDDs, but it's up to the code being executed to use the GPUs. On Tue, Sep 21, 2021 at 1:23 PM

Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Abhishek Shakya
usage for RDD interfaces? PS: The question is posted in stackoverflow as well: Link <https://stackoverflow.com/questions/69273205/does-apache-spark-3-support-gpu-usage-for-spark-rdds> Regards, - Abhishek Shakya Senior Data Scientist 1, Contact: +919002319890 | Em

Out of scope RDDs not getting cleaned up

2020-08-18 Thread jainbhavya53
Hi, I am using spark 2.1 and I am leveraging spark streaming for my data pipeline. Now, in my case the batch size is 3 minutes and we persist couple of RDDs while processing a batch and after processing we rely on Spark's ContextCleaner to clean out RDDs which are no longer in scope. So we have

Async API to save RDDs?

2020-08-05 Thread Antonin Delpeuch (lists)
Hi, The RDD API provides async variants of a few RDD methods, which let the user execute the corresponding jobs asynchronously. This makes it possible to cancel the jobs for instance: https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/AsyncRDDActions.html There does not seem to

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Ideally...we would like to copy paste and try in our end. A screenshot is not enough. If you have private information just remove and create a minimum example we can use to replicate the issue. I'd say similar to this : https://stackoverflow.com/help/mcve On Mon, 07 Jan 2019 04:15:16

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Sorry,the code is too long,it is simple to say look at the photo i define a arrayBuffer ,there are "1 2", '' 2 3" ," 4 5" in it ,I want to save in hdfs ,so i make it to RDD, sc.

Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Please share a minimum amount of code to try reproduce the issue... On Mon, 07 Jan 2019 00:46:42 -0800 fyyleej...@163.com wrote Hi all, In my experiment program,I used spark Graphx, when running on the Idea in windows,the result is right, but when runing on the linux distributed

Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Hi all, In my experiment program,I used spark Graphx, when running on the Idea in windows,the result is right, but when runing on the linux distributed cluster,the result in hdfs is empty, why?how to solve?

Is there any window operation for RDDs in Pyspark? like for DStreams

2018-11-20 Thread zakhavan
Hello, I have two RDDs and my goal is to calculate the Pearson's correlation between them using sliding window. I want to have 200 samples in each window from rdd1 and rdd2 and calculate the correlation between them and then slide the window with 120 samples and calculate the correlation between

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-04 Thread zakhavan
Thank you. It helps. Zeinab -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
window operation on RDDs in Pyspark? Thank you, Taylor for your reply. The second solution doesn't work for my case since my text files are getting updated every second. Actually, my input data is live such that I'm getting 2 streams of data from 2 seismic sensors and then I write them into 2 text

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread zakhavan
and produce them as DStreams. The following code is how I'm getting the data and write them into 2 text files. Do you have any idea how I can use Kafka in this case so that I have DStreams instead of RDDs? from obspy.clients.seedlink.easyseedlink import create_client from obspy import read import numpy

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
window operation on RDDs in Pyspark? Hello, I have 2 text file in the following form and my goal is to calculate the Pearson correlation between them using sliding window in pyspark: 123.00 -12.00 334.00 . . . First I read these 2 text file and store them in RDD format and then I apply

How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread zakhavan
t;) if CrossCorr >= 0.7: print("rdd1 & rdd2 are correlated") I know from the error that window operation is only for DStream but since I have RDD here how I can do window operation on RDDs? Thank you, Zeinab

[Spark Core] details of persisting RDDs

2018-03-23 Thread Stefano Pettini
Hi, couple of questions about the internals of the persist mechanism (RDD, but maybe applicable also to DS/DF). Data is processed stage by stage. So what actually runs in worker nodes is the calculation of the partitions of the result of a stage, not the single RDDs. Operation of all the RDDs

[SparkQL] how are RDDs partitioned and distributed in a standalone cluster?

2018-02-18 Thread prabhastechie
Say I have a main method with the following pseudo-code (to be run on a spark standalone cluster): main(args) { RDD rdd rdd1 = rdd.map(...) // some other statements not using RDD rdd2 = rdd.filter(...) } When executed, will each of the two statements involving RDDs (map and filter

Re: Union of RDDs Hung

2017-12-12 Thread Gerard Maas
Can you show us the code? On Tue, Dec 12, 2017 at 9:02 AM, Vikash Pareek <vikaspareek1...@gmail.com> wrote: > Hi All, > > I am unioning 2 rdds(each of them having 2 records) but this union it is > getting hang. > I found a solution to this that is caching both the rdds befo

Union of RDDs Hung

2017-12-12 Thread Vikash Pareek
Hi All, I am unioning 2 rdds(each of them having 2 records) but this union it is getting hang. I found a solution to this that is caching both the rdds before performing union but I could not figure out the root cause of hanging the job. Is somebody knows why this happens with union? Spark

[spark-core] Choosing the correct number of partitions while joining two RDDs with partitioner set on one

2017-08-07 Thread Piyush Narang
hi folks, I was debugging a Spark job that ending up with too few partitions during the join step and thought I'd reach out understand if this is the right behavior / what typical workarounds are. I have two RDDs that I'm joining. One with a lot of partitions (5K+) and one with much lesser

Re: json in Cassandra to RDDs

2017-07-01 Thread ayan guha
, > > I'm using Cassandra with only 2 fields (id, json). > I'm using Spark to query the json. Until now I can use a json file and > query that file, but Cassandra and RDDs of the json field not yet. > > sc = spark.sparkContext > path = "/home/me/red50k.json&qu

json in Cassandra to RDDs

2017-07-01 Thread Conconscious
Hi list, I'm using Cassandra with only 2 fields (id, json). I'm using Spark to query the json. Until now I can use a json file and query that file, but Cassandra and RDDs of the json field not yet. sc = spark.sparkContext path = "/home/me/red50k.json" redirectsDF = spark.read

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread ayan guha
r 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianth...@outlook.com> > wrote: > > Hi, > > I want to perform some simple transformations and check the execution time, > under various configurations (e.g. number of cores being used, number of > partitions etc). Since it

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread khwunchai jaengsawang
ssible to set the partitions of a > dataframe , I guess that I should probably use RDDs. > > I've got a dataset with 3 columns as shown below: > > val data = file.map(line => line.split(" ")) > .filter(lines => lines.length == 3) // ignore first lin

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread bryan . jeffrey
ecution time, under various configurations (e.g. number of cores being used, number of partitions etc). Since it is not possible to set the partitions of a dataframe , I guess that I should probably use RDDs. I've got a dataset with 3 columns as shown below: val data = file.map

[RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread Old-School
Hi, I want to perform some simple transformations and check the execution time, under various configurations (e.g. number of cores being used, number of partitions etc). Since it is not possible to set the partitions of a dataframe , I guess that I should probably use RDDs. I've got a dataset

Re: withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Richard Startin
r occurred while calling o36.sql. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 2.0 failed 4 times, most recent failure: Lost task 18.3 in stage 2.0 (TID 186, lxpbda25.ra1.intra.groupama.fr<http://lxpbda25.ra1.intra.groupama.fr>): org.apache.spark.Spa

Re: Sharing RDDS across applications and users

2016-10-28 Thread vincent gromakowski
1 PM, Victor Shafran <victor.shaf...@equalum.io> >> wrote: >> >> Hi Vincent, >> Can you elaborate on how to implement "shared sparkcontext and fair >> scheduling" option? >> >> My approach was to use sparkSession.getOrCreate() method and

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
mp table in one application. However, I was not able to access this > tempTable in another application. > You help is highly appreciated > Victor > > On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com> wrote: > >> Hi Mich, >> >> Yes, Alluxio is co

Re: Sharing RDDS across applications and users

2016-10-28 Thread Chanh Le
was not able to access this tempTable in > another application. > You help is highly appreciated > Victor > > On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com > <mailto:gene.p...@gmail.com>> wrote: > Hi Mich, > > Yes, Alluxio is commonly used

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
of using Zeppelin to share RDDs with many users. From > the notes on Zeppelin it appears that this is sharing UI and I am not sure > how easy it is going to be changing the result set with different users > modifying say sql queries. > > There is also the idea of caching RDDs with so

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
gmail.com> wrote: > >> Hi Mich, >> >> Yes, Alluxio is commonly used to cache and share Spark RDDs and >> DataFrames among different applications and contexts. The data typically >> stays in memory, but with Alluxio's tiered storage, the "colder"

Re: Sharing RDDS across applications and users

2016-10-27 Thread Victor Shafran
is highly appreciated Victor On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com> wrote: > Hi Mich, > > Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames > among different applications and contexts. The data typically stays in > memory, bu

Re: Sharing RDDS across applications and users

2016-10-27 Thread Gene Pang
Hi Mich, Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames among different applications and contexts. The data typically stays in memory, but with Alluxio's tiered storage, the "colder" data can be evicted out to other medium, like SSDs and HDDs. Here is a

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
:59 GMT+02:00 vincent gromakowski < >> vincent.gromakow...@gmail.com>: >> >>> I would prefer sharing the spark context and using FAIR scheduler for >>> user concurrency >>> >>> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" <mich.t

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
; >> a écrit : >> >>> thanks Vince. >>> >>> So Ignite uses some hash/in-memory indexing. >>> >>> The question is in practice is there much use case to use these two >>> fabrics for sharing RDDs. >>> >>> Remember all RDBMSs do

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
:48 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a > écrit : > >> thanks Vince. >> >> So Ignite uses some hash/in-memory indexing. >> >> The question is in practice is there much use case to use these two >> fabrics for sharing RDDs. &g

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
e is there much use case to use these two > fabrics for sharing RDDs. > > Remember all RDBMSs do this through shared memory. > > In layman's term if I have two independent spark-submit running, can they > share result set. For example the same tempTable etc? > > Ch

Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
il.com> > wrote: > > Thanks Chanh, > > Can it share RDDs. > > Personally I have not used either Alluxio or Ignite. > > Are there major differences between these two > Have you tried Alluxio for sharing Spark RDDs and if so do you have any > experience you

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
thanks Vince. So Ignite uses some hash/in-memory indexing. The question is in practice is there much use case to use these two fabrics for sharing RDDs. Remember all RDBMSs do this through shared memory. In layman's term if I have two independent spark-submit running, can they share result set

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Ignite works only with spark 1.5 Ignite leverage indexes Alluxio provides tiering Alluxio easily integrates with underlying FS Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a écrit : > Thanks Chanh, > > Can it share RDDs. > > Personally

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
Thanks Chanh, Can it share RDDs. Personally I have not used either Alluxio or Ignite. 1. Are there major differences between these two 2. Have you tried Alluxio for sharing Spark RDDs and if so do you have any experience you can kindly share Regards Dr Mich Talebzadeh LinkedIn

Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich, Alluxio is the good option to go. Regards, Chanh > On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > > There was a mention of using Zeppelin to share RDDs with many users. From the > notes on Zeppelin it appears that thi

Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
There was a mention of using Zeppelin to share RDDs with many users. From the notes on Zeppelin it appears that this is sharing UI and I am not sure how easy it is going to be changing the result set with different users modifying say sql queries. There is also the idea of caching RDDs

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
= 1224.6 MB. Storage limit = 1397.3 MB. Therefore, I repartitioned the RDDs for better memory utilisation, wich resolved the issue. Kind regards, Guru On 11 October 2016 at 11:23, diplomatic Guru <diplomaticg...@gmail.com> wrote: > @Song, I have called an action but it did not

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
@Song, I have called an action but it did not cache as you can see in the provided screenshot on my original email. It has cahced into Disk but not memory. @Chin Wei Low, I have 15GB memory allocated which is more than the dataset size. Any other suggestion please? Kind regards, Guru On 11

Re: [Spark] RDDs are not persisting in memory

2016-10-10 Thread Chin Wei Low
Hi, Your RDD is 5GB, perhaps it is too large to fit into executor's storage memory. You can refer to the Executors tab in Spark UI to check the available memory for storage for each of the executor. Regards, Chin Wei On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru

[Spark] RDDs are not persisting in memory

2016-10-10 Thread diplomatic Guru
Hello team, Spark version: 1.6.0 I'm trying to persist done data into memory for reusing them. However, when I call rdd.cache() OR rdd.persist(StorageLevel.MEMORY_ONLY()) it does not store the data as I can not see any rdd information under WebUI (Storage Tab). Therefore I tried

Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
Thanks Daniel. Do you have any code fragments on using CoGroups or Joins across 2 RDDs ? I don't think that index would help much because this is an N x M operation, examining each cell of each RDD. Each comparison is complex as it needs to peer into a complex JSON On Mon, Aug 15, 2016 at 1:24

Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Daniel Imberman
or Join between the two RDDs. if index matters, you can use ZipWithIndex on both before you join and then see which indexes match up. On Mon, Aug 15, 2016 at 1:15 PM Eric Ho <e...@analyticsmd.com> wrote: > I've nested foreach loops like this: > > for i in A[i] do: >

How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
I've nested foreach loops like this: for i in A[i] do: for j in B[j] do: append B[j] to some list if B[j] 'matches' A[i] in some fashion. Each element in A or B is some complex structure like: ( some complex JSON, some number ) Question: if A and B were represented as RRDs (e.g.

Re: how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Jörn Franke
s in Array B do: > > compare a[3] with b[4] see if they 'match' and if match, return that element; > > If I were to represent Arrays A and B as 2 separate RDDs, how would my code > look like ? > > I couldn't find any RDD functions that would do this for me efficientl

how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Eric Ho
Hi, I've two nested-for loops like this: *for all elements in Array A do:* *for all elements in Array B do:* *compare a[3] with b[4] see if they 'match' and if match, return that element;* If I were to represent Arrays A and B as 2 separate RDDs, how would my code look like ? I couldn't find

groupBy cannot handle large RDDs

2016-06-29 Thread Kaiyin Zhong
Could anyone have a look at this? It looks like a bug: http://stackoverflow.com/questions/38106554/groupby-cannot-handle-large-rdds Best regards, Kaiyin ZHONG

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
v, context).zipWithIndex.map { x => >>> (x._1, split.startIndex + x._2) >>> >>> You can modify the second component of the tuple to take data.length >>> into account. >>> >>> On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com>

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu
to account. >> >> On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com> >> wrote: >> >>> Hi >>> >>> I wanted to change the functioning of the "zipWithIndex" function for >>> spark RDDs in which the

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com> > wrote: > >> Hi >> >> I wanted to change the functioning of the "zipWithIndex" function for >> spark RDDs in which the output of the function is, just for an example

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu
can modify the second component of the tuple to take data.length into account. On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com> wrote: > Hi > > I wanted to change the functioning of the "zipWithIndex" function for > spark RDDs in which the

Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Hi I wanted to change the functioning of the "zipWithIndex" function for spark RDDs in which the output of the function is, just for an example, "(data, prev_index+data.length)" instead of "(data,prev_index+1)". How can I do this? -- Thank You Regards Punit Naik

Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
ch file is approx hundreds > of MBs to 2-3 gigs) into one big parquet file. > > I am loading each one of them and trying to take a union, however this leads > to enormous amounts of partitions, as union keeps on adding the partitions of > the input RDDs together. > > I also

Re: Union of multiple RDDs

2016-06-21 Thread Eugene Morozov
on, however this > leads to enormous amounts of partitions, as union keeps on adding the > partitions of the input RDDs together. > > I also tried loading all the files via wildcard, but that behaves almost > the same as union i.e. generates a lot of partitions. > > One of the appro

Union of multiple RDDs

2016-06-21 Thread Apurva Nandan
RDDs together. I also tried loading all the files via wildcard, but that behaves almost the same as union i.e. generates a lot of partitions. One of the approach that I thought was to reparititon the rdd generated after each union and then continue the process, but I don't know how efficient

Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
trics about each line (record type, line length, etc). >> Most are identical so I'm calling distinct(). >> >> In the loop over the list of files, I'm saving up the resulting RDDs into >> a List. After the loop, I use the JavaSparkContext union(JavaRDD... >> rdds) met

Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Eugene Morozov
type, line length, etc). > Most are identical so I'm calling distinct(). > > In the loop over the list of files, I'm saving up the resulting RDDs into > a List. After the loop, I use the JavaSparkContext union(JavaRDD... > rdds) method to collapse the tables into one. > > Like this -- &g

StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
, I'm saving up the resulting RDDs into a List. After the loop, I use the JavaSparkContext union(JavaRDD... rdds) method to collapse the tables into one. Like this -- List<JavaRDD> allMetrics = ... for (int i = 0; i < files.size(); i++) { JavaPairRDD<...> lines = jsc.n

Re: Using data frames to join separate RDDs in spark streaming

2016-06-05 Thread Cyril Scetbon
id") >> >>val df = rdd1.toDF("id", "aid") >> >>df.select(explode(df("aid")).as("aid"), df("id")) >> .join(df_aids, $"aid" === df_aids("id")) >>

Re: Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
val df_aids = rdd.toDF("id") > > val df = rdd1.toDF("id", "aid") > > df.select(explode(df("aid")).as("aid"), df("id")) >.join(df_aids, $"aid" === df_aids("id")) >

Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
t;) df.select(explode(df("aid")).as("aid"), df("id")) .join(df_aids, $"aid" === df_aids("id")) .select(df("id"), df_aids("id")) . } Is there a way to still use Dataframes to do i

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Sonal Goyal
You can look at ways to group records from both rdds together instead of doing Cartesian. Say generate pair rdd from each with first letter as key. Then do a partition and a join. On May 25, 2016 8:04 PM, "Priya Ch" <learnings.chitt...@gmail.com> wrote: > Hi, > RDD A is

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi, RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would like to filter out the strings that have greater than 85% match and generate a score for it which is used in the susbsequent calculations. I tried generating pair rdd from both rdds A and B with same key for all

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
se match is > 85%. We trying to Fuzzy match logic. > > How can use map/reduce operations across 2 rdds ? > > Thanks, > Padma Ch > >> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> Alternatively depending on the e

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Why do i need to deploy solr for text anaytics...i have files placed in HDFS. just need to look for matches against each string in both files and generate those records whose match is > 85%. We trying to Fuzzy match logic. How can use map/reduce operations across 2 rdds ? Thanks, Padma

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Alternatively depending on the exact use case you may employ solr on Hadoop for text analytics > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
No this is not needed, look at the map / reduce operations and the standard spark word count > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as {"padma","hihi","chch","priya"}.

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
>>>>> The results of this query matches every row in the FinancialCodes >>>>> table with every row in the FinancialData table. Each row consists >>>>> of all columns from the FinancialCodes table followed by all columns from >>>>> the FinancialData ta

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
t; >>>>> SELECT * FROM FinancialCodes, FinancialData >>>>> >>>>> The results of this query matches every row in the FinancialCodes table >>>>> with every row in the FinancialData table. Each row consists of all >>>>> columns from the Financial

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
the FinancialCodes table >>>> with every row in the FinancialData table. Each row consists of all >>>> columns from the FinancialCodes table followed by all columns from the >>>> FinancialData table. >>>> >>>> >>>> Not very useful >>>> >

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
kedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> &g

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
s://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote: >> >>&g

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
; > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 25 May 2016 at 08

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Mich Talebzadeh
ss.com On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote: > Hi All, > > I have two RDDs A and B where in A is of size 30 MB and B is of size 7 > MB, A.cartesian(B) is taking too much time. Is there any bottleneck in > cartesian operation ? > > I am usin

Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi All, I have two RDDs A and B where in A is of size 30 MB and B is of size 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in cartesian operation ? I am using spark 1.6.0 version Regards, Padma Ch

How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD

How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD

Re: Confused - returning RDDs from functions

2016-05-13 Thread Dood
On 5/12/2016 10:01 PM, Holden Karau wrote: This is not the expected behavior, can you maybe post the code where you are running into this? Hello, thanks for replying! Below is the function I took out from the code. def converter(rdd:

Re: Confused - returning RDDs from functions

2016-05-12 Thread Holden Karau
This is not the expected behavior, can you maybe post the code where you are running into this? On Thursday, May 12, 2016, Dood@ODDO wrote: > Hello all, > > I have been programming for years but this has me baffled. > > I have an RDD[(String,Int)] that I return from a

Confused - returning RDDs from functions

2016-05-12 Thread Dood
Hello all, I have been programming for years but this has me baffled. I have an RDD[(String,Int)] that I return from a function after extensive manipulation of an initial RDD of a different type. When I return this RDD and initiate the .collectAsMap() on it from the caller, I get an empty

Re: RDDs caching in typical machine learning use cases

2016-04-04 Thread Eugene Morozov
On Sun, Apr 3, 2016 at 11:34 AM, Sergey <ser...@gmail.com> wrote: > Hi Spark ML experts! > > Do you use RDDs caching somewhere together with ML lib to speed up > calculation? > I mean typical machine learning use cases. > Train-test split, train, evaluate, apply model. > > Sergey. >

RDDs caching in typical machine learning use cases

2016-04-03 Thread Sergey
Hi Spark ML experts! Do you use RDDs caching somewhere together with ML lib to speed up calculation? I mean typical machine learning use cases. Train-test split, train, evaluate, apply model. Sergey.

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Thamme Gowda N.
uence file, but it >>> works for text file. >>> >>> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com> >>> wrote: >>> >>>> Hi spark experts, >>>> >>>> I am facing issues with cached RDDs. I noti

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
t;> Looks like a spark bug. I can reproduce it for sequence file, but it >> works for text file. >> >> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com> >> wrote: >> >>> Hi spark experts, >>> >>> I am facing issues

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
ence file, but it works > for text file. > > On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com> > wrote: > >> Hi spark experts, >> >> I am facing issues with cached RDDs. I noticed that few entries >> get duplicated for n times when the RDD is cac

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
Looks like a spark bug. I can reproduce it for sequence file, but it works for text file. On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com> wrote: > Hi spark experts, > > I am facing issues with cached RDDs. I noticed that few entries > get duplicated for n

[Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Thamme Gowda N.
Hi spark experts, I am facing issues with cached RDDs. I noticed that few entries get duplicated for n times when the RDD is cached. I asked a question on Stackoverflow with my code snippet to reproduce it. I really appreciate if you can visit http://stackoverflow.com/q/36168827/1506477

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-20 Thread Jakob Odersky
fter > changing parameter > > spark.sql.autoBroadcastJoinThreshold to 10 > > > Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal > numbers of partitions > at > org.apache.spark.rdd.ZippedPartitionsBaseRDD.

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-19 Thread Jiří Syrový
hu, Mar 17, 2016 at 10:03 AM, Jiří Syrový <syrovy.j...@gmail.com> > wrote: > > Hi, > > > > any idea what could be causing this issue? It started appearing after > > changing parameter > > > > spark.sql.autoBroadcastJoinThreshold to 100000 > > >

Can't zip RDDs with unequal numbers of partitions

2016-03-18 Thread Jiří Syrový
Hi, any idea what could be causing this issue? It started appearing after changing parameter *spark.sql.autoBroadcastJoinThreshold to 10* Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
well the "hadoop" way is to save to a/b and a/c and read from a/* :) On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Spark users and developers, > > anyone knows how to union two RDDs without the overhead of it? > > sa

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
05 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Spark users and developers, >> >> anyone knows how to union two RDDs without the overhead of it? >> >> say rdd1.union(rdd2).saveTextFile(..) >> This requires a stage to union the 2 rdds before saveAs

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra
Agree with Koert that UnionRDD should have a narrow dependencies . Although union of two RDDs increases the number of tasks to be executed ( rdd1.partitions + rdd2.partitions) . If your two RDDs have same number of partitions , you can also use zipPartitions, which causes lesser number of tasks

  1   2   3   4   5   6   >