Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Artemis User
Unfortunately the answer you got from the forum is true.  The current 
Spark-rapids package doesn't support RDD.  Please see 
https://nvidia.github.io/spark-rapids/docs/FAQ.html#what-parts-of-apache-spark-are-accelerated


I guess to be able to use spark-rapids, one option you have would be to 
convert Hail to use the DataFrame API instead of RDD.  Hope this helps...


-- ND

On 9/21/21 1:38 PM, Abhishek Shakya wrote:


Hi,

I am currently trying to run genomic analyses pipelines using 
Hail(library for genomics analyses written in python and Scala). 
Recently, Apache Spark 3 was released and it supported GPU usage.


I tried spark-rapids library to start an on-premise slurm cluster with 
gpu nodes. I was able to initialise the cluster. However, when I tried 
running hail tasks, the executors kept getting killed.


On querying in Hail forum, I got the response that

That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any 
Spark-SQL interfaces, only the RDD interfaces.

So, does Spark3 not support GPU usage for RDD interfaces?


PS: The question is posted in stackoverflow as well: Link 
<https://stackoverflow.com/questions/69273205/does-apache-spark-3-support-gpu-usage-for-spark-rdds>



Regards,
-

Abhishek Shakya
Senior Data Scientist 1,
Contact: +919002319890 | Email ID: abhishek.sha...@aganitha.ai 
<mailto:abhishek.sha...@aganitha.ai>

Aganitha Cognitive Solutions <https://aganitha.ai/>




Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Sean Owen
spark-rapids is not part of Spark, so couldn't speak to it, but Spark
itself does not use GPUs at all.
It does let you configure a task to request a certain number of GPUs, and
that would work for RDDs, but it's up to the code being executed to use the
GPUs.

On Tue, Sep 21, 2021 at 1:23 PM Abhishek Shakya 
wrote:

>
> Hi,
>
> I am currently trying to run genomic analyses pipelines using Hail(library
> for genomics analyses written in python and Scala). Recently, Apache Spark
> 3 was released and it supported GPU usage.
>
> I tried spark-rapids library to start an on-premise slurm cluster with gpu
> nodes. I was able to initialise the cluster. However, when I tried running
> hail tasks, the executors kept getting killed.
>
> On querying in Hail forum, I got the response that
>
> That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any
> Spark-SQL interfaces, only the RDD interfaces.
> So, does Spark3 not support GPU usage for RDD interfaces?
>
>
> PS: The question is posted in stackoverflow as well: Link
> <https://stackoverflow.com/questions/69273205/does-apache-spark-3-support-gpu-usage-for-spark-rdds>
>
>
> Regards,
> -
>
> Abhishek Shakya
> Senior Data Scientist 1,
> Contact: +919002319890 | Email ID: abhishek.sha...@aganitha.ai
> Aganitha Cognitive Solutions <https://aganitha.ai/>
>


Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Abhishek Shakya
Hi,

I am currently trying to run genomic analyses pipelines using Hail(library
for genomics analyses written in python and Scala). Recently, Apache Spark
3 was released and it supported GPU usage.

I tried spark-rapids library to start an on-premise slurm cluster with gpu
nodes. I was able to initialise the cluster. However, when I tried running
hail tasks, the executors kept getting killed.

On querying in Hail forum, I got the response that

That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any
Spark-SQL interfaces, only the RDD interfaces.
So, does Spark3 not support GPU usage for RDD interfaces?


PS: The question is posted in stackoverflow as well: Link
<https://stackoverflow.com/questions/69273205/does-apache-spark-3-support-gpu-usage-for-spark-rdds>


Regards,
-

Abhishek Shakya
Senior Data Scientist 1,
Contact: +919002319890 | Email ID: abhishek.sha...@aganitha.ai
Aganitha Cognitive Solutions <https://aganitha.ai/>


Out of scope RDDs not getting cleaned up

2020-08-18 Thread jainbhavya53
Hi,

I am using spark 2.1 and I am leveraging spark streaming for my data
pipeline. Now, in my case the batch size is 3 minutes and we persist couple
of RDDs while processing a batch and after processing we rely on Spark's
ContextCleaner to clean out RDDs which are no longer in scope.

So we have set "spark.cleaner.periodicGC.interval" = "15s" and
"spark.network.timeout" = "20s".

Now sometimes, when GC is triggered and it tries to clean all the out of
scope RDDs, then a Future timeout occurs(File attached -->
FuturesTimeout.txt) and it says "Failed to remove RDD 14254".

FuturesTImeoutException.txt
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t10974/FuturesTImeoutException.txt>
  

So, when I go and check for this particular RDD under storage tab, then I
can see the same RDD is still there(verified using RddId -> 14254).

That's fine unless it is queued for cleanup in subsequent GC cycle !
But that does not happen and I could see this RDD under the storage tab.
This happened for couple of more RDDs.

So, I tried looking into it and it seems once ContextCleaner sends a request
for cleaning the RDD and then if an error occurs while cleaning this RDD
then it does not re-queue the given RDD for cleaning.

It seems like a bug to me. Can you please look into it !

Thanks
Bhavya Jain



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Async API to save RDDs?

2020-08-05 Thread Antonin Delpeuch (lists)
Hi,

The RDD API provides async variants of a few RDD methods, which let the
user execute the corresponding jobs asynchronously. This makes it
possible to cancel the jobs for instance:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/AsyncRDDActions.html

There does not seem to be async versions of the save methods such as
`saveAsTextFile`:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#saveAsTextFile-java.lang.String-

Is there another way to start such jobs and get a handle on them (such
as the job id)? Specifically, I would like to be able to stop save jobs
on user request.

Thank you,
Antonin

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Ideally...we would like to copy paste and try in our end. A screenshot is not 
enough.

If you have private information just remove and create a minimum example we can 
use to replicate the issue.
I'd say similar to this :

https://stackoverflow.com/help/mcve

 On Mon, 07 Jan 2019 04:15:16 -0800 fyyleej...@163.com wrote 

Sorry,the code is too long,it is simple to say 
look at the photo 

 

i define a arrayBuffer ,there are "1 2", '' 2 3" ," 4 5" in it ,I want to 
save in hdfs ,so i make it to RDD, 
sc. pallelize(arraybuffeer) 
but when in idea,i use println(_),the value is right,but in distributed 
there is nothing 



-- 
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 

- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 



Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Sorry,the code is too long,it is simple to say 
look at the photo

 

i define a arrayBuffer ,there are "1 2",  '' 2 3" ," 4 5" in it ,I want to
save in hdfs ,so i make it to RDD,
sc. pallelize(arraybuffeer)
but when in idea,i use println(_),the value is right,but in distributed
there is nothing 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Please share a minimum amount of code to try reproduce the issue...

 On Mon, 07 Jan 2019 00:46:42 -0800 fyyleej...@163.com wrote 

Hi all, 
In my experiment program,I used spark Graphx, 
when running on the Idea in windows,the result is right, 
but when runing on the linux distributed cluster,the result in hdfs is 
empty, 
why?how to solve? 

 

Thanks! 
Jian Li 



-- 
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 

- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 



Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Hi all,
In  my experiment program,I used spark Graphx,
when running on the Idea in windows,the result is right,
but when runing  on the linux distributed cluster,the result in hdfs is
empty,
why?how to solve?

 

Thanks!
Jian Li



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is there any window operation for RDDs in Pyspark? like for DStreams

2018-11-20 Thread zakhavan
Hello,

I have two RDDs and my goal is to calculate the Pearson's correlation
between them using sliding window. I want to have 200 samples in each window
from rdd1 and rdd2 and calculate the correlation between them and then slide
the window with 120 samples and calculate the correlation between next 200
samples of windows.I know sliding window works for DStream but I have to use
RDD instead of DStream. When I use window function for RDD i get an error
saying RDD doesn't have window attribute. The reason that I need to use
window operation here is that 1) rdd1 and rdd2 are infinite streams and I
need to partition it to the smaller chunks like windows 2) This built-in
Pearson's correlation function in Pyspark only works for the partitions with
equal size so in my case I chose 200 samples per window and 120 samples for
sliding interval.
I'd appreciate it if you have any idea how to solve it.

My code is here:
if __name__ == "__main__":
sc = SparkContext(appName="CorrelationsExample")
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]
num_of_partitions = 1
rdd1 = sc.textFile(input_path1, num_of_partitions).flatMap(lambda line1:
line1.strip().split("\n")).map(lambda strelem1: float(strelem1))
rdd2 = sc.textFile(input_path2, num_of_partitions).flatMap(lambda line2:
line2.strip().split("\n")).map(lambda strelem2: float(strelem2))
1 = rdd1.collect()
l2 = rdd2.collect()
seriesX = sc.parallelize(l1)
seriesY = sc.parallelize(l2)
print("Correlation is: " + str(Statistics.corr(seriesX, seriesY,
method="pearson")))
sc.stop()



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-04 Thread zakhavan
Thank you. It helps.

Zeinab



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
Have a look at this guide here:
https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html

You should be able to send your sensor data to a Kafka topic, which Spark will 
subscribe to. You may need to use an Input DStream to connect Kafka to Spark.

https://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/#read-parallelism-in-spark-streaming

Taylor

-Original Message-
From: zakhavan  
Sent: Tuesday, October 2, 2018 1:16 PM
To: user@spark.apache.org
Subject: RE: How to do sliding window operation on RDDs in Pyspark?

Thank you, Taylor for your reply. The second solution doesn't work for my case 
since my text files are getting updated every second. Actually, my input data 
is live such that I'm getting 2 streams of data from 2 seismic sensors and then 
I write them into 2 text files for simplicity and this is being done in 
real-time and text files get updated. But it seems I need to change my data 
collection method and store it as 2 DStreams. I know Kafka will work but I 
don't know how to do that because I will need to implement a custom Kafka 
consumer to consume the incoming data from the sensors and produce them as 
DStreams.

The following code is how I'm getting the data and write them into 2 text files.

Do you have any idea how I can use Kafka in this case so that I have DStreams 
instead of RDDs?

from obspy.clients.seedlink.easyseedlink import create_client from obspy import 
read import numpy as np import obspy from obspy import UTCDateTime


def handle_data(trace):
print('Received new data:')
print(trace)
print()


if trace.stats.network == "IU":
trace.write("/home/zeinab/data1.mseed")
st1 = obspy.read("/home/zeinab/data1.mseed")
for i, el1 in enumerate(st1):
f = open("%s_%d" % ("out_file1.txt", i), "a")
f1 = open("%s_%d" % ("timestamp_file1.txt", i), "a")
np.savetxt(f, el1.data, fmt="%f")
np.savetxt(f1, el1.times("utcdatetime"), fmt="%s")
f.close()
f1.close()
if trace.stats.network == "CU":
trace.write("/home/zeinab/data2.mseed")
st2 = obspy.read("/home/zeinab/data2.mseed")
for j, el2 in enumerate(st2):
ff = open("%s_%d" % ("out_file2.txt", j), "a")
ff1 = open("%s_%d" % ("timestamp_file2.txt", j), "a")
np.savetxt(ff, el2.data, fmt="%f")
np.savetxt(ff1, el2.times("utcdatetime"), fmt="%s")
ff.close()
ff1.close()







client = create_client('rtserve.iris.washington.edu:18000', handle_data) 
client.select_stream('IU', 'ANMO', 'BHZ') client.select_stream('CU', 'ANWB', 
'BHZ')
client.run()

Thank you,

Zeinab



--
Sent from: 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fdata=02%7C01%7CTaylor.Cox%40microsoft.com%7C4fc4bb46120a45b8074808d628a3daea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636741081549350604sdata=Ucj9pU3mow1woS%2Bp%2B5F9eyYkKPzTyvGFuPnYWhEgsBk%3Dreserved=0

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread zakhavan
Thank you, Taylor for your reply. The second solution doesn't work for my
case since my text files are getting updated every second. Actually, my
input data is live such that I'm getting 2 streams of data from 2 seismic 
sensors and then I write them into 2 text files for simplicity and this is
being done in real-time and text files get updated. But it seems I need to
change my data collection method and store it as 2 DStreams. I know Kafka
will work but I don't know how to do that because I will need to implement a
custom Kafka consumer to consume the incoming data from the sensors and
produce them as DStreams.

The following code is how I'm getting the data and write them into 2 text
files.

Do you have any idea how I can use Kafka in this case so that I have
DStreams instead of RDDs?

from obspy.clients.seedlink.easyseedlink import create_client
from obspy import read
import numpy as np
import obspy
from obspy import UTCDateTime


def handle_data(trace):
print('Received new data:')
print(trace)
print()


if trace.stats.network == "IU":
trace.write("/home/zeinab/data1.mseed")
st1 = obspy.read("/home/zeinab/data1.mseed")
for i, el1 in enumerate(st1):
f = open("%s_%d" % ("out_file1.txt", i), "a")
f1 = open("%s_%d" % ("timestamp_file1.txt", i), "a")
np.savetxt(f, el1.data, fmt="%f")
np.savetxt(f1, el1.times("utcdatetime"), fmt="%s")
f.close()
f1.close()
if trace.stats.network == "CU":
trace.write("/home/zeinab/data2.mseed")
st2 = obspy.read("/home/zeinab/data2.mseed")
for j, el2 in enumerate(st2):
ff = open("%s_%d" % ("out_file2.txt", j), "a")
ff1 = open("%s_%d" % ("timestamp_file2.txt", j), "a")
np.savetxt(ff, el2.data, fmt="%f")
np.savetxt(ff1, el2.times("utcdatetime"), fmt="%s")
ff.close()
ff1.close()







client = create_client('rtserve.iris.washington.edu:18000', handle_data)
client.select_stream('IU', 'ANMO', 'BHZ')
client.select_stream('CU', 'ANWB', 'BHZ')
client.run()

Thank you,

Zeinab



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
Hey Zeinab,

We may have to take a small step back here. The sliding window approach (ie: 
the window operation) is unique to Data stream mining. So it makes sense that 
window() is restricted to DStream. 

It looks like you're not using a stream mining approach. From what I can see in 
your code, the files are being read in, and you are using the window() 
operation after you have all the information.

Here's what can solve your problem:
1) Read the inputs into two DStreams and use window() as needed, or
2) You can always take a range of inputs from a spark RDD. Perhaps this will 
set you in the right direction:
https://stackoverflow.com/questions/24677180/how-do-i-select-a-range-of-elements-in-spark-rdd

Let me know if this helps your issue,

Taylor

-Original Message-
From: zakhavan  
Sent: Tuesday, October 2, 2018 9:30 AM
To: user@spark.apache.org
Subject: How to do sliding window operation on RDDs in Pyspark?

Hello,

I have 2 text file in the following form and my goal is to calculate the 
Pearson correlation between them using sliding window in pyspark:

123.00
-12.00
334.00
.
.
.

First I read these 2 text file and store them in RDD format and then I apply 
the window operation on each RDD but I keep getting this error:
*
AttributeError: 'PipelinedRDD' object has no attribute window*

Here is my code:

if __name__ == "__main__":
spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate()
#   DEFINE your input path
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]



num_of_partitions = 4
rdd1 = spark.sparkContext.textFile(input_path1,
num_of_partitions).flatMap(lambda line1:
line1.split("\n").strip()).map(lambda strelem1: float(strelem1))
rdd2 = spark.sparkContext.textFile(input_path2,
num_of_partitions).flatMap(lambda line2:
line2.split("\n").strip()).map(lambda strelem2: float(strelem2))

#Windowing
windowedrdd1= rdd1.window(3,2)
windowedrdd2= rdd2.window(3,2)

#Correlation between sliding windows

CrossCorr = Statistics.corr(windowedrdd1, windowedrdd2,
method="pearson")


if CrossCorr >= 0.7:
print("rdd1 & rdd2 are correlated")

I know from the error that window operation is only for DStream but since I 
have RDD here how I can do window operation on RDDs?

Thank you,

Zeinab





--
Sent from: 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fdata=02%7C01%7CTaylor.Cox%40microsoft.com%7C67fd11306aa44701149c08d628845f7b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636740946337699799sdata=SrN2Aa80JjxZkX4diCllXgkGRADWxleXaJovd8YcfGY%3Dreserved=0

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread zakhavan
Hello,

I have 2 text file in the following form and my goal is to calculate the
Pearson correlation between them using sliding window in pyspark:

123.00
-12.00
334.00
.
.
.

First I read these 2 text file and store them in RDD format and then I apply
the window operation on each RDD but I keep getting this error:
*
AttributeError: 'PipelinedRDD' object has no attribute window*

Here is my code:

if __name__ == "__main__":
spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate()
#   DEFINE your input path
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]



num_of_partitions = 4
rdd1 = spark.sparkContext.textFile(input_path1,
num_of_partitions).flatMap(lambda line1:
line1.split("\n").strip()).map(lambda strelem1: float(strelem1))
rdd2 = spark.sparkContext.textFile(input_path2,
num_of_partitions).flatMap(lambda line2:
line2.split("\n").strip()).map(lambda strelem2: float(strelem2))

#Windowing
windowedrdd1= rdd1.window(3,2)
windowedrdd2= rdd2.window(3,2)

#Correlation between sliding windows

CrossCorr = Statistics.corr(windowedrdd1, windowedrdd2,
method="pearson")


if CrossCorr >= 0.7:
print("rdd1 & rdd2 are correlated")

I know from the error that window operation is only for DStream but since I
have RDD here how I can do window operation on RDDs?

Thank you,

Zeinab





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark Core] details of persisting RDDs

2018-03-23 Thread Stefano Pettini
Hi,

couple of questions about the internals of the persist mechanism (RDD, but
maybe applicable also to DS/DF).

Data is processed stage by stage. So what actually runs in worker nodes is
the calculation of the partitions of the result of a stage, not the single
RDDs. Operation of all the RDDs that form a stage are run together. That at
least how I interpret the UI and the logs.

Then, what does "persisting an RDD" that is in the middle of a stage
actually mean? Let's say the result of a map, that is located before
another map, located before a reduce. Persisting A that is inside the stage
A -> B -> C.

Also the hint "persist an RDD if it's used more than once and you don't
want it to be calculated twice" is not precise. For example, if inside a
stage we have:

A -> B -> C -> E -> F
 | |
  --> D -->

So basically a diamond, where B is used twice, as input of C and D, but
then the workflow re-joins in E, all inside the same stage, no shuffling. I
tested and B is not calculated twice. And again the original question: what
does actually happen when B is marked to be persisted?

Regards,
Stefano


[SparkQL] how are RDDs partitioned and distributed in a standalone cluster?

2018-02-18 Thread prabhastechie
Say I have a main method with the following pseudo-code (to be run on a spark
standalone cluster):
main(args) {
  RDD rdd
  rdd1 = rdd.map(...)
  // some other statements not using RDD
  rdd2 = rdd.filter(...)
}

When executed, will each of the two statements involving RDDs (map and
filter) be individually partitioned and distributed on available cluster
nodes? And any statements not involving RDDs (or data frames) will typically
be executed on the driver?
Is that how spark take advantage of the cluster?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Union of RDDs Hung

2017-12-12 Thread Gerard Maas
Can you show us the code?

On Tue, Dec 12, 2017 at 9:02 AM, Vikash Pareek <vikaspareek1...@gmail.com>
wrote:

> Hi All,
>
> I am unioning 2 rdds(each of them having 2 records) but this union it is
> getting hang.
> I found a solution to this that is caching both the rdds before performing
> union but I could not figure out the root cause of hanging the job.
>
> Is somebody knows why this happens with union?
>
> Spark version I am using is 1.6.1
>
>
> Best Regards,
> Vikash Pareek
>
>
>
> -
>
> __Vikash Pareek
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Union of RDDs Hung

2017-12-12 Thread Vikash Pareek
Hi All,

I am unioning 2 rdds(each of them having 2 records) but this union it is
getting hang.
I found a solution to this that is caching both the rdds before performing
union but I could not figure out the root cause of hanging the job.

Is somebody knows why this happens with union?

Spark version I am using is 1.6.1


Best Regards,
Vikash Pareek



-

__Vikash Pareek
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[spark-core] Choosing the correct number of partitions while joining two RDDs with partitioner set on one

2017-08-07 Thread Piyush Narang
hi folks,

I was debugging a Spark job that ending up with too few partitions during
the join step and thought I'd reach out understand if this is the right
behavior / what typical workarounds are.

I have two RDDs that I'm joining. One with a lot of partitions (5K+) and
one with much lesser partitions (< 50). I perform a reduceByKey on the
smallerRDD and then join the two together. I notice that the join
operations ends up with numPartitions = smallerRDD.numPartitions. This
seems to stem from the code in Partitioner.defaultPartitioner
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L57>.
That code checks if either of the RDDs has a partitioner specified and if
it does, it picks the partitioner and numPartitions of that RDD. In my case
as I'm calling reduceByKey on the smaller RDD, that ends up with a
partitioner being set and thus that's what we end up with along with the
much fewer number of partitions.

I'm currently just specifying the number of partitions I want, but I was
wondering if others have run into this and if there are other suggested
workarounds? To partition my larger RDD as well? Would it make sense in the
defaultPartitioner function to account for if the number of partitions is
much larger in one RDD?

Here's a simple snippet that illustrates things:

val largeRDD = sc.parallelize( List( (1,10), (1,11), (2,20), (2,21),
(3, 30), (3,31)), 100)val smallRDD = sc.parallelize( List( (1,"one"),
(2,"two"), (3,"three")), 2).reduceByKey((l, _) => l)
// end up with a join with 2 partitions
largeRDD.join(smallRDD).collect().foreach(println)


Thanks,

-- 
- Piyush


Re: json in Cassandra to RDDs

2017-07-01 Thread ayan guha
Hi

If you are asking how to parse the json column from Cassandra, I would
suggest you to look into from_json function. It would help you to parse a
json field, given you know the schema upfront.

On Sat, Jul 1, 2017 at 8:54 PM, Conconscious <conconsci...@gmail.com> wrote:

> Hi list,
>
> I'm using Cassandra with only 2 fields (id, json).
> I'm using Spark to query the json. Until now I can use a json file and
> query that file, but Cassandra and RDDs of the json field not yet.
>
> sc = spark.sparkContext
> path = "/home/me/red50k.json"
> redirectsDF = spark.read.json(path)
> redirectsDF.createOrReplaceTempView("red")
> result = spark.sql("select idv from red where idv = '9'")
> result.show()
>
> val conf = new SparkConf(true)
> .set("spark.cassandra.connection.host", "192.168.1.74")
> .set("spark.cassandra.auth.username", "cassandra")
> .set("spark.cassandra.auth.password", "cassandra")
> val sc = new SparkContext("spark://192.168.1.74:7077", "test", conf)
> val table = sc.cassandraTable("test", "ttable")
> println(ttable.count)
>
> Some help please to join both things. Scala or Python code for me it's ok.
> Thanks in advance.
> Cheers.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


json in Cassandra to RDDs

2017-07-01 Thread Conconscious
Hi list,

I'm using Cassandra with only 2 fields (id, json).
I'm using Spark to query the json. Until now I can use a json file and
query that file, but Cassandra and RDDs of the json field not yet.

sc = spark.sparkContext
path = "/home/me/red50k.json"
redirectsDF = spark.read.json(path)
redirectsDF.createOrReplaceTempView("red")
result = spark.sql("select idv from red where idv = '9'")
result.show()

val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.1.74")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
val sc = new SparkContext("spark://192.168.1.74:7077", "test", conf)
val table = sc.cassandraTable("test", "ttable")
println(ttable.count)

Some help please to join both things. Scala or Python code for me it's ok.
Thanks in advance.
Cheers.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread ayan guha
Just as best practice, dataframe and datasets are preferred way, so try not
to resort to rdd unless you absolutely have to...

On Sun, 5 Mar 2017 at 7:10 pm, khwunchai jaengsawang <khwuncha...@ku.th>
wrote:

> Hi Old-Scool,
>
>
> For the first question, you can specify the number of partition in any
> DataFrame by using
> repartition(numPartitions: Int, partitionExprs: Column*).
> *Example*:
> val partitioned = data.repartition(numPartitions=10).cache()
>
> For your second question, you can transform your RDD into PairRDD and use
> reduceByKey()
> *Example:*
> val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_)
>
>
> Best,
>
>   Khwunchai Jaengsawang
>   *Email*: khwuncha...@ku.th
>   LinkedIn <https://linkedin.com/in/khwunchai> | Github
> <https://github.com/khwunchai>
>
>
> On Mar 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianth...@outlook.com>
> wrote:
>
> Hi,
>
> I want to perform some simple transformations and check the execution time,
> under various configurations (e.g. number of cores being used, number of
> partitions etc). Since it is not possible to set the partitions of a
> dataframe , I guess that I should probably use RDDs.
>
> I've got a dataset with 3 columns as shown below:
>
> val data = file.map(line => line.split(" "))
>  .filter(lines => lines.length == 3) // ignore first line
>  .map(row => (row(0), row(1), row(2)))
>  .toDF("ID", "word-ID", "count")
> results in:
>
> +--++-+
> | ID |  word-ID   |  count   |
> +--++-+
> |  15   |87  |   151|
> |  20   |19  |   398|
> |  15   |19  |   21  |
> |  180 |90  |   190|
> +---+-+
> So how can I turn the above into an RDD in order to use e.g.
> sc.parallelize(data, 10) and set the number of partitions to say 10?
>
> Furthermore, I would also like to ask about the equivalent expression
> (using
> RDD API) for the following simple transformation:
>
> data.select("word-ID",
> "count").groupBy("word-ID").agg(sum($"count").as("count")).show()
>
>
>
> Thanks in advance
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> --
Best Regards,
Ayan Guha


Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread khwunchai jaengsawang
Hi Old-Scool,


For the first question, you can specify the number of partition in any 
DataFrame by using repartition(numPartitions: Int, partitionExprs: Column*).
Example:
val partitioned = data.repartition(numPartitions=10).cache()

For your second question, you can transform your RDD into PairRDD and use 
reduceByKey()
Example:
val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_)


Best,

  Khwunchai Jaengsawang
  Email: khwuncha...@ku.th
  LinkedIn <https://linkedin.com/in/khwunchai> | Github 
<https://github.com/khwunchai>


> On Mar 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianth...@outlook.com> 
> wrote:
> 
> Hi,
> 
> I want to perform some simple transformations and check the execution time,
> under various configurations (e.g. number of cores being used, number of
> partitions etc). Since it is not possible to set the partitions of a
> dataframe , I guess that I should probably use RDDs. 
> 
> I've got a dataset with 3 columns as shown below:
> 
> val data = file.map(line => line.split(" "))
>  .filter(lines => lines.length == 3) // ignore first line
>  .map(row => (row(0), row(1), row(2)))
>  .toDF("ID", "word-ID", "count")
> results in:
> 
> +--++-+
> | ID |  word-ID   |  count   |
> +--++-+
> |  15   |87  |   151|
> |  20   |19  |   398|
> |  15   |19  |   21  |
> |  180 |90  |   190|
> +---+-+
> So how can I turn the above into an RDD in order to use e.g.
> sc.parallelize(data, 10) and set the number of partitions to say 10? 
> 
> Furthermore, I would also like to ask about the equivalent expression (using
> RDD API) for the following simple transformation:
> 
> data.select("word-ID",
> "count").groupBy("word-ID").agg(sum($"count").as("count")).show()
> 
> 
> 
> Thanks in advance
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread bryan . jeffrey


Rdd operation:


rdd.map(x => (word, count)).reduceByKey(_+_)






Get Outlook for Android









On Sat, Mar 4, 2017 at 8:59 AM -0500, "Old-School" 
<giorgos_myrianth...@outlook.com> wrote:










Hi,

I want to perform some simple transformations and check the execution time,
under various configurations (e.g. number of cores being used, number of
partitions etc). Since it is not possible to set the partitions of a
dataframe , I guess that I should probably use RDDs. 

I've got a dataset with 3 columns as shown below:

val data = file.map(line => line.split(" "))
  .filter(lines => lines.length == 3) // ignore first line
  .map(row => (row(0), row(1), row(2)))
  .toDF("ID", "word-ID", "count")
results in:

+--++-+
| ID |  word-ID   |  count   |
+--++-+
|  15   |87  |   151|
|  20   |19  |   398|
|  15   |19  |   21  |
|  180 |90  |   190|
+---+-+
So how can I turn the above into an RDD in order to use e.g.
sc.parallelize(data, 10) and set the number of partitions to say 10? 

Furthermore, I would also like to ask about the equivalent expression (using
RDD API) for the following simple transformation:

data.select("word-ID",
"count").groupBy("word-ID").agg(sum($"count").as("count")).show()



Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org








[RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread Old-School
Hi,

I want to perform some simple transformations and check the execution time,
under various configurations (e.g. number of cores being used, number of
partitions etc). Since it is not possible to set the partitions of a
dataframe , I guess that I should probably use RDDs. 

I've got a dataset with 3 columns as shown below:

val data = file.map(line => line.split(" "))
  .filter(lines => lines.length == 3) // ignore first line
  .map(row => (row(0), row(1), row(2)))
  .toDF("ID", "word-ID", "count")
results in:

+--++-+
| ID |  word-ID   |  count   |
+--++-+
|  15   |87  |   151|
|  20   |19  |   398|
|  15   |19  |   21  |
|  180 |90  |   190|
+---+-+
So how can I turn the above into an RDD in order to use e.g.
sc.parallelize(data, 10) and set the number of partitions to say 10? 

Furthermore, I would also like to ask about the equivalent expression (using
RDD API) for the following simple transformation:

data.select("word-ID",
"count").groupBy("word-ID").agg(sum($"count").as("count")).show()



Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Richard Startin
I think limit repartitions your data into a single partition if called as a non 
terminal operator. Hence zip works after limit because you only have one 
partition.

In practice, I have found joins to be much more applicable than zip because of 
the strict limitation of identical partitions.

https://richardstartin.com

On 20 Dec 2016, at 16:04, Jack Wenger 
<jack.wenge...@gmail.com<mailto:jack.wenge...@gmail.com>> wrote:

Hello,

I'm facing a strange behaviour with Spark 1.5.0 (Cloudera 5.5.1).
I'm loading data from Hive with HiveContext (~42M records) and then try to add 
a new column with "withColumn" and a UDF.
Finally i'm suppose to create a new Hive table from this dataframe.


Here is the code :

_
_


DATETIME_TO_COMPARE = "-12-31 23:59:59.99"

myFunction = udf(lambda col: 0 if col != DATETIME_TO_COMPARE else 1, 
IntegerType())

df1 = hc.sql("SELECT col1, col2, col3,col4,col5,col6,col7 FROM myTable WHERE 
col4 == someValue")

df2 = df1.withColumn("myNewCol", myFunction(df1.col3))
df2.registerTempTable("df2")

hc.sql("create table my_db.new_table as select * from df2")

_
_


But I get this error :


py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
stage 2.0 failed 4 times, most recent failure: Lost task 18.3 in stage 2.0 (TID 
186, lxpbda25.ra1.intra.groupama.fr<http://lxpbda25.ra1.intra.groupama.fr>): 
org.apache.spark.SparkException: Can only zip RDDs with same number of elements 
in each partition
at 
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.hasNext(RDD.scala:832)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org<http://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org>$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:104)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




What is suprising is that if I modify the select statement by addind a LIMIT 
1 (which is more than twice the number of records in my table), then 
it's working :

_
_

df1 = hc.sql("SELECT col1, col2, col3,col4,col5,col6,col7 FROM myTable WHERE 
col4 == someValue" LIMIT 1)

_
_

In both cases, if I run a count() on df1, I'm getting the same number : 42 593 
052

Is it a bug or am I missing something ?
If it is not a bug, what am I doing wrong ?


Thank you !


Jack


Re: Sharing RDDS across applications and users

2016-10-28 Thread vincent gromakowski
Bad idea. No caching, cluster over consumption...
Have a look on instantiating a custom thriftserver on temp tables with
fair  scheduler to allow concurrent SQL requests. It's not a public API but
you can find some examples.

Le 28 oct. 2016 11:12 AM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a
écrit :

> Hi,
>
> I think tempTable is private to the session that creates it. In Hive temp
> tables created by "CREATE TEMPORARY TABLE" are all private to the session.
> Spark is no different.
>
> The alternative may be everyone creates tempTable from the same DF?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 October 2016 at 10:03, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Can you elaborate on how to implement "shared sparkcontext and fair
>> scheduling" option?
>>
>>
>> It just reuse 1 Spark Context by not letting it stop when the application
>> had done. Should check: livy, spark-jobserver
>> FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html just how
>> you scheduler your job in the pool but FAIR help you run job in parallel vs
>> FIFO (default) 1 job at the time.
>>
>>
>> My approach was to use  sparkSession.getOrCreate() method and register
>> temp table in one application. However, I was not able to access this
>> tempTable in another application.
>>
>>
>> Store metadata in Hive may help but I am not sure about this.
>> I use Spark Thrift Server create table on that then let Zeppelin query
>> from that.
>>
>> Regards,
>> Chanh
>>
>>
>>
>>
>>
>> On Oct 27, 2016, at 9:01 PM, Victor Shafran <victor.shaf...@equalum.io>
>> wrote:
>>
>> Hi Vincent,
>> Can you elaborate on how to implement "shared sparkcontext and fair
>> scheduling" option?
>>
>> My approach was to use  sparkSession.getOrCreate() method and register
>> temp table in one application. However, I was not able to access this
>> tempTable in another application.
>> You help is highly appreciated
>> Victor
>>
>> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>>> DataFrames among different applications and contexts. The data typically
>>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>>> om/blog/effective-spark-rdds-with-alluxio
>>>
>>> Also, Alluxio also has the concept of an "Under filesystem", which can
>>> help you access your existing data across different storage systems. Here
>>> is more information about the unified namespace abilities:
>>> http://www.alluxio.org/docs/master/en/Unified-and
>>> -Transparent-Namespace.html
>>>
>>> Hope that helps,
>>> Gene
>>>
>>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Thanks Chanh,
>>>>
>>>> Can it share RDDs.
>>>>
>>>> Personally I have not used either Alluxio or Ignite.
>>>>
>>>>
>>>>1. Are there major differences between these two
>>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>>have any experience you can kindly share
>>>>
>>>> Regards
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disc

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
Hi,

I think tempTable is private to the session that creates it. In Hive temp
tables created by "CREATE TEMPORARY TABLE" are all private to the session.
Spark is no different.

The alternative may be everyone creates tempTable from the same DF?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 10:03, Chanh Le <giaosu...@gmail.com> wrote:

> Can you elaborate on how to implement "shared sparkcontext and fair
> scheduling" option?
>
>
> It just reuse 1 Spark Context by not letting it stop when the application
> had done. Should check: livy, spark-jobserver
> FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html just how you
> scheduler your job in the pool but FAIR help you run job in parallel vs
> FIFO (default) 1 job at the time.
>
>
> My approach was to use  sparkSession.getOrCreate() method and register
> temp table in one application. However, I was not able to access this
> tempTable in another application.
>
>
> Store metadata in Hive may help but I am not sure about this.
> I use Spark Thrift Server create table on that then let Zeppelin query
> from that.
>
> Regards,
> Chanh
>
>
>
>
>
> On Oct 27, 2016, at 9:01 PM, Victor Shafran <victor.shaf...@equalum.io>
> wrote:
>
> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair
> scheduling" option?
>
> My approach was to use  sparkSession.getOrCreate() method and register
> temp table in one application. However, I was not able to access this
> tempTable in another application.
> You help is highly appreciated
> Victor
>
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>> DataFrames among different applications and contexts. The data typically
>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>> om/blog/effective-spark-rdds-with-alluxio
>>
>> Also, Alluxio also has the concept of an "Under filesystem", which can
>> help you access your existing data across different storage systems. Here
>> is more information about the unified namespace abilities:
>> http://www.alluxio.org/docs/master/en/Unified-and
>> -Transparent-Namespace.html
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>> Alluxio is the good option to go.
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>
&g

Re: Sharing RDDS across applications and users

2016-10-28 Thread Chanh Le
> Can you elaborate on how to implement "shared sparkcontext and fair 
> scheduling" option? 


It just reuse 1 Spark Context by not letting it stop when the application had 
done. Should check: livy, spark-jobserver
FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html 
<https://spark.apache.org/docs/1.2.0/job-scheduling.html> just how you 
scheduler your job in the pool but FAIR help you run job in parallel vs FIFO 
(default) 1 job at the time.


> My approach was to use  sparkSession.getOrCreate() method and register temp 
> table in one application. However, I was not able to access this tempTable in 
> another application. 


Store metadata in Hive may help but I am not sure about this.
I use Spark Thrift Server create table on that then let Zeppelin query from 
that.

Regards,
Chanh





> On Oct 27, 2016, at 9:01 PM, Victor Shafran <victor.shaf...@equalum.io> wrote:
> 
> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair 
> scheduling" option? 
> 
> My approach was to use  sparkSession.getOrCreate() method and register temp 
> table in one application. However, I was not able to access this tempTable in 
> another application. 
> You help is highly appreciated 
> Victor
> 
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com 
> <mailto:gene.p...@gmail.com>> wrote:
> Hi Mich,
> 
> Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames 
> among different applications and contexts. The data typically stays in 
> memory, but with Alluxio's tiered storage, the "colder" data can be evicted 
> out to other medium, like SSDs and HDDs. Here is a blog post discussing Spark 
> RDDs and Alluxio: 
> https://www.alluxio.com/blog/effective-spark-rdds-with-alluxio 
> <https://www.alluxio.com/blog/effective-spark-rdds-with-alluxio>
> 
> Also, Alluxio also has the concept of an "Under filesystem", which can help 
> you access your existing data across different storage systems. Here is more 
> information about the unified namespace abilities: 
> http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html 
> <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>
> 
> Hope that helps,
> Gene
> 
> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> Thanks Chanh,
> 
> Can it share RDDs.
> 
> Personally I have not used either Alluxio or Ignite.
> 
> Are there major differences between these two
> Have you tried Alluxio for sharing Spark RDDs and if so do you have any 
> experience you can kindly share
> Regards
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Alluxio is the good option to go. 
> 
> Regards,
> Chanh
> 
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> 
>> There was a mention of using Zeppelin to share RDDs with many users. From 
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure 
>> how easy it is going to be changing the result set with different users 
>> modifying say sql queries.
>> 
>> There is also the idea of caching RDDs with something like Apache Ignite. 
>> Has anyone really tried this. Will that work with multiple applications?
>> 
>> It looks feasible as RDDs are immutable and so are registered tempTables etc.
>> 
>> Thanks
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 
> 
> 
> 
> 
> 
> -- 
> Victor Shafran
> 
> VP R| Equalum
> 
> 
> Mobile: +972-523854883 <tel:%2B972-523854883> | Email: 
> victor.shaf...@equalum.io <mailto:victor.shaf...@equalum.io>


Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
Thanks all for your advice.

As I understand in layman's term if I had two applications running
successfully where app 2 was dependent on app 1 I would finish app 1, store
the results in HDFS and the app 2 starts reading the results from HDFS and
work on it.

Using  Alluxio or others replaces HDFS with in-memory storage where app 2
can pick up app1 results from memory or even SSD and do the work.

Actually I am surprised why Spark has not incorporated this type of memory
as temporary storage.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 11:28, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> There was a mention of using Zeppelin to share RDDs with many users. From
> the notes on Zeppelin it appears that this is sharing UI and I am not sure
> how easy it is going to be changing the result set with different users
> modifying say sql queries.
>
> There is also the idea of caching RDDs with something like Apache Ignite.
> Has anyone really tried this. Will that work with multiple applications?
>
> It looks feasible as RDDs are immutable and so are registered tempTables
> etc.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Hi
Just point all users on the same app with a common spark context.
For instance akka http receives queries from user and launch concurrent
spark SQL queries in different actor thread. The only prerequsite is to
launch the different jobs in different threads (like with actors).
Be carefull it's not CRUD if one of the job modifies dataset, it's OK for
read only.

Le 27 oct. 2016 4:02 PM, "Victor Shafran" <victor.shaf...@equalum.io> a
écrit :

> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair
> scheduling" option?
>
> My approach was to use  sparkSession.getOrCreate() method and register
> temp table in one application. However, I was not able to access this
> tempTable in another application.
> You help is highly appreciated
> Victor
>
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>> DataFrames among different applications and contexts. The data typically
>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>> om/blog/effective-spark-rdds-with-alluxio
>>
>> Also, Alluxio also has the concept of an "Under filesystem", which can
>> help you access your existing data across different storage systems. Here
>> is more information about the unified namespace abilities:
>> http://www.alluxio.org/docs/master/en/Unified-and
>> -Transparent-Namespace.html
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>> Alluxio is the good option to go.
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> There was a mention of using Zeppelin to share RDDs with many users.
>>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>>> sure how easy it is going to be changing the result set with different
>>>> users modifying say sql queries.
>>>>
>>>> There is also the idea of caching RDDs with something like Apache
>>>> Ignite. Has anyone really tried this. Will that work with multiple
>>>> applications?
>>>>
>>>> It looks feasible as RDDs are immutable and so are registered
>>>> tempTables etc.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
>
> Victor Shafran
>
> VP R| Equalum
>
> Mobile: +972-523854883 | Email: victor.shaf...@equalum.io
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Victor Shafran
Hi Vincent,
Can you elaborate on how to implement "shared sparkcontext and fair
scheduling" option?

My approach was to use  sparkSession.getOrCreate() method and register temp
table in one application. However, I was not able to access this tempTable
in another application.
You help is highly appreciated
Victor

On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com> wrote:

> Hi Mich,
>
> Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames
> among different applications and contexts. The data typically stays in
> memory, but with Alluxio's tiered storage, the "colder" data can be evicted
> out to other medium, like SSDs and HDDs. Here is a blog post discussing
> Spark RDDs and Alluxio: https://www.alluxio.com/blog/effective-spark-rdds-
> with-alluxio
>
> Also, Alluxio also has the concept of an "Under filesystem", which can
> help you access your existing data across different storage systems. Here
> is more information about the unified namespace abilities:
> http://www.alluxio.org/docs/master/en/Unified-
> and-Transparent-Namespace.html
>
> Hope that helps,
> Gene
>
> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Chanh,
>>
>> Can it share RDDs.
>>
>> Personally I have not used either Alluxio or Ignite.
>>
>>
>>1. Are there major differences between these two
>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>have any experience you can kindly share
>>
>> Regards
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>>
>>> Hi Mich,
>>> Alluxio is the good option to go.
>>>
>>> Regards,
>>> Chanh
>>>
>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>
>>> There was a mention of using Zeppelin to share RDDs with many users.
>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>> sure how easy it is going to be changing the result set with different
>>> users modifying say sql queries.
>>>
>>> There is also the idea of caching RDDs with something like Apache
>>> Ignite. Has anyone really tried this. Will that work with multiple
>>> applications?
>>>
>>> It looks feasible as RDDs are immutable and so are registered tempTables
>>> etc.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>
>


-- 

Victor Shafran

VP R| Equalum

Mobile: +972-523854883 | Email: victor.shaf...@equalum.io


Re: Sharing RDDS across applications and users

2016-10-27 Thread Gene Pang
Hi Mich,

Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames
among different applications and contexts. The data typically stays in
memory, but with Alluxio's tiered storage, the "colder" data can be evicted
out to other medium, like SSDs and HDDs. Here is a blog post discussing
Spark RDDs and Alluxio:
https://www.alluxio.com/blog/effective-spark-rdds-with-alluxio

Also, Alluxio also has the concept of an "Under filesystem", which can help
you access your existing data across different storage systems. Here is
more information about the unified namespace abilities:
http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html

Hope that helps,
Gene

On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Chanh,
>
> Can it share RDDs.
>
> Personally I have not used either Alluxio or Ignite.
>
>
>1. Are there major differences between these two
>2. Have you tried Alluxio for sharing Spark RDDs and if so do you have
>any experience you can kindly share
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Hi Mich,
>> Alluxio is the good option to go.
>>
>> Regards,
>> Chanh
>>
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>
>> There was a mention of using Zeppelin to share RDDs with many users. From
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure
>> how easy it is going to be changing the result set with different users
>> modifying say sql queries.
>>
>> There is also the idea of caching RDDs with something like Apache Ignite.
>> Has anyone really tried this. Will that work with multiple applications?
>>
>> It looks feasible as RDDs are immutable and so are registered tempTables
>> etc.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
For this you will need to contribute...

Le 27 oct. 2016 1:35 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a
écrit :

> so I assume Ignite will not work with Spark version >=2?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 12:27, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> some options:
>> - ignite for spark 1.5, can deep store on cassandra
>> - alluxio for all spark versions, can deep store on hdfs, gluster...
>>
>> ==> these are best for sharing between jobs
>>
>> - shared sparkcontext and fair scheduling, seems to be not thread safe
>> - spark jobserver and namedRDD, CRUD thread safe RDD sharing between
>> spark jobs
>> ==> these are best for sharing between users
>>
>> 2016-10-27 12:59 GMT+02:00 vincent gromakowski <
>> vincent.gromakow...@gmail.com>:
>>
>>> I would prefer sharing the spark context  and using FAIR scheduler for
>>> user concurrency
>>>
>>> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>> a écrit :
>>>
>>>> thanks Vince.
>>>>
>>>> So Ignite uses some hash/in-memory indexing.
>>>>
>>>> The question is in practice is there much use case to use these two
>>>> fabrics for sharing RDDs.
>>>>
>>>> Remember all RDBMSs do this through shared memory.
>>>>
>>>> In layman's term if I have two independent spark-submit running, can
>>>> they share result set. For example the same tempTable etc?
>>>>
>>>> Cheers
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 27 October 2016 at 11:44, vincent gromakowski <
>>>> vincent.gromakow...@gmail.com> wrote:
>>>>
>>>>> Ignite works only with spark 1.5
>>>>> Ignite leverage indexes
>>>>> Alluxio provides tiering
>>>>> Alluxio easily integrates with underlying FS
>>>>>
>>>>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>> Thanks Chanh,
>>>>>>
>>>>>> Can it share RDDs.
>>>>>>
>>>>>> Personally I have not used either Alluxio or Ignite.
>>>>>>
>>>>>>
>>>>>>1. Are there major differences between these two
>>>>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>>>>have any experience you can kindly share
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> h

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
so I assume Ignite will not work with Spark version >=2?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 12:27, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> some options:
> - ignite for spark 1.5, can deep store on cassandra
> - alluxio for all spark versions, can deep store on hdfs, gluster...
>
> ==> these are best for sharing between jobs
>
> - shared sparkcontext and fair scheduling, seems to be not thread safe
> - spark jobserver and namedRDD, CRUD thread safe RDD sharing between spark
> jobs
> ==> these are best for sharing between users
>
> 2016-10-27 12:59 GMT+02:00 vincent gromakowski <
> vincent.gromakow...@gmail.com>:
>
>> I would prefer sharing the spark context  and using FAIR scheduler for
>> user concurrency
>>
>> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>> a écrit :
>>
>>> thanks Vince.
>>>
>>> So Ignite uses some hash/in-memory indexing.
>>>
>>> The question is in practice is there much use case to use these two
>>> fabrics for sharing RDDs.
>>>
>>> Remember all RDBMSs do this through shared memory.
>>>
>>> In layman's term if I have two independent spark-submit running, can
>>> they share result set. For example the same tempTable etc?
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:44, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>>> Ignite works only with spark 1.5
>>>> Ignite leverage indexes
>>>> Alluxio provides tiering
>>>> Alluxio easily integrates with underlying FS
>>>>
>>>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>>> a écrit :
>>>>
>>>>> Thanks Chanh,
>>>>>
>>>>> Can it share RDDs.
>>>>>
>>>>> Personally I have not used either Alluxio or Ignite.
>>>>>
>>>>>
>>>>>1. Are there major differences between these two
>>>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>>>have any experience you can kindly share
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 27 October 2016 at 11:29, Chanh L

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
some options:
- ignite for spark 1.5, can deep store on cassandra
- alluxio for all spark versions, can deep store on hdfs, gluster...

==> these are best for sharing between jobs

- shared sparkcontext and fair scheduling, seems to be not thread safe
- spark jobserver and namedRDD, CRUD thread safe RDD sharing between spark
jobs
==> these are best for sharing between users

2016-10-27 12:59 GMT+02:00 vincent gromakowski <
vincent.gromakow...@gmail.com>:

> I would prefer sharing the spark context  and using FAIR scheduler for
> user concurrency
>
> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a
> écrit :
>
>> thanks Vince.
>>
>> So Ignite uses some hash/in-memory indexing.
>>
>> The question is in practice is there much use case to use these two
>> fabrics for sharing RDDs.
>>
>> Remember all RDBMSs do this through shared memory.
>>
>> In layman's term if I have two independent spark-submit running, can they
>> share result set. For example the same tempTable etc?
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:44, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> Ignite works only with spark 1.5
>>> Ignite leverage indexes
>>> Alluxio provides tiering
>>> Alluxio easily integrates with underlying FS
>>>
>>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>> a écrit :
>>>
>>>> Thanks Chanh,
>>>>
>>>> Can it share RDDs.
>>>>
>>>> Personally I have not used either Alluxio or Ignite.
>>>>
>>>>
>>>>1. Are there major differences between these two
>>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>>have any experience you can kindly share
>>>>
>>>> Regards
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>>>>
>>>>> Hi Mich,
>>>>> Alluxio is the good option to go.
>>>>>
>>>>> Regards,
>>>>> Chanh
>>>>>
>>>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> There was a mention of using Zeppelin to share RDDs with many users.
>>>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>>>> sure how easy it is going to be changing the result set with different
>>>>> users modifying say sql queries.
>>>>>
>>>>> There is also the idea of caching RDDs with something like Apache
>>>>> Ignite. Has anyone really tried this. Will that work with multiple
>>>>> applications?
>>>>>
>>>>> It looks feasible as RDDs are immutable and so are registered
>>>>> tempTables etc.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
I would prefer sharing the spark context  and using FAIR scheduler for user
concurrency

Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a
écrit :

> thanks Vince.
>
> So Ignite uses some hash/in-memory indexing.
>
> The question is in practice is there much use case to use these two
> fabrics for sharing RDDs.
>
> Remember all RDBMSs do this through shared memory.
>
> In layman's term if I have two independent spark-submit running, can they
> share result set. For example the same tempTable etc?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:44, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Ignite works only with spark 1.5
>> Ignite leverage indexes
>> Alluxio provides tiering
>> Alluxio easily integrates with underlying FS
>>
>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>> a écrit :
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>> Alluxio is the good option to go.
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> There was a mention of using Zeppelin to share RDDs with many users.
>>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>>> sure how easy it is going to be changing the result set with different
>>>> users modifying say sql queries.
>>>>
>>>> There is also the idea of caching RDDs with something like Apache
>>>> Ignite. Has anyone really tried this. Will that work with multiple
>>>> applications?
>>>>
>>>> It looks feasible as RDDs are immutable and so are registered
>>>> tempTables etc.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich,
I only tried Alluxio so I can’t give you a comparison.
In my experience, I use Alluxio for the big data set (50GB - 100GB) which is 
the input of the pipelines jobs so you can reuse the result from previous job.


> On Oct 27, 2016, at 5:39 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Thanks Chanh,
> 
> Can it share RDDs.
> 
> Personally I have not used either Alluxio or Ignite.
> 
> Are there major differences between these two
> Have you tried Alluxio for sharing Spark RDDs and if so do you have any 
> experience you can kindly share
> Regards
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Alluxio is the good option to go. 
> 
> Regards,
> Chanh
> 
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> 
>> There was a mention of using Zeppelin to share RDDs with many users. From 
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure 
>> how easy it is going to be changing the result set with different users 
>> modifying say sql queries.
>> 
>> There is also the idea of caching RDDs with something like Apache Ignite. 
>> Has anyone really tried this. Will that work with multiple applications?
>> 
>> It looks feasible as RDDs are immutable and so are registered tempTables etc.
>> 
>> Thanks
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 
> 



Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
thanks Vince.

So Ignite uses some hash/in-memory indexing.

The question is in practice is there much use case to use these two fabrics
for sharing RDDs.

Remember all RDBMSs do this through shared memory.

In layman's term if I have two independent spark-submit running, can they
share result set. For example the same tempTable etc?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 11:44, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Ignite works only with spark 1.5
> Ignite leverage indexes
> Alluxio provides tiering
> Alluxio easily integrates with underlying FS
>
> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a
> écrit :
>
>> Thanks Chanh,
>>
>> Can it share RDDs.
>>
>> Personally I have not used either Alluxio or Ignite.
>>
>>
>>    1. Are there major differences between these two
>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>have any experience you can kindly share
>>
>> Regards
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>>
>>> Hi Mich,
>>> Alluxio is the good option to go.
>>>
>>> Regards,
>>> Chanh
>>>
>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>
>>> There was a mention of using Zeppelin to share RDDs with many users.
>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>> sure how easy it is going to be changing the result set with different
>>> users modifying say sql queries.
>>>
>>> There is also the idea of caching RDDs with something like Apache
>>> Ignite. Has anyone really tried this. Will that work with multiple
>>> applications?
>>>
>>> It looks feasible as RDDs are immutable and so are registered tempTables
>>> etc.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Ignite works only with spark 1.5
Ignite leverage indexes
Alluxio provides tiering
Alluxio easily integrates with underlying FS

Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> a
écrit :

> Thanks Chanh,
>
> Can it share RDDs.
>
> Personally I have not used either Alluxio or Ignite.
>
>
>1. Are there major differences between these two
>2. Have you tried Alluxio for sharing Spark RDDs and if so do you have
>any experience you can kindly share
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Hi Mich,
>> Alluxio is the good option to go.
>>
>> Regards,
>> Chanh
>>
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>
>> There was a mention of using Zeppelin to share RDDs with many users. From
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure
>> how easy it is going to be changing the result set with different users
>> modifying say sql queries.
>>
>> There is also the idea of caching RDDs with something like Apache Ignite.
>> Has anyone really tried this. Will that work with multiple applications?
>>
>> It looks feasible as RDDs are immutable and so are registered tempTables
>> etc.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
Thanks Chanh,

Can it share RDDs.

Personally I have not used either Alluxio or Ignite.


   1. Are there major differences between these two
   2. Have you tried Alluxio for sharing Spark RDDs and if so do you have
   any experience you can kindly share

Regards


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com> wrote:

> Hi Mich,
> Alluxio is the good option to go.
>
> Regards,
> Chanh
>
> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>
> There was a mention of using Zeppelin to share RDDs with many users. From
> the notes on Zeppelin it appears that this is sharing UI and I am not sure
> how easy it is going to be changing the result set with different users
> modifying say sql queries.
>
> There is also the idea of caching RDDs with something like Apache Ignite.
> Has anyone really tried this. Will that work with multiple applications?
>
> It looks feasible as RDDs are immutable and so are registered tempTables
> etc.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich,
Alluxio is the good option to go. 

Regards,
Chanh

> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> 
> There was a mention of using Zeppelin to share RDDs with many users. From the 
> notes on Zeppelin it appears that this is sharing UI and I am not sure how 
> easy it is going to be changing the result set with different users modifying 
> say sql queries.
> 
> There is also the idea of caching RDDs with something like Apache Ignite. Has 
> anyone really tried this. Will that work with multiple applications?
> 
> It looks feasible as RDDs are immutable and so are registered tempTables etc.
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  



Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
There was a mention of using Zeppelin to share RDDs with many users. From
the notes on Zeppelin it appears that this is sharing UI and I am not sure
how easy it is going to be changing the result set with different users
modifying say sql queries.

There is also the idea of caching RDDs with something like Apache Ignite.
Has anyone really tried this. Will that work with multiple applications?

It looks feasible as RDDs are immutable and so are registered tempTables
etc.

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
Hello team. so I found and resolved the issue. In case if someone run into
same problem this was the problem.

>>Each nodes were allocated 1397MB memory for storages.
16/10/11 13:16:58 INFO storage.MemoryStore: MemoryStore started with
capacity 1397.3 MB

>> However, my RDD exceeded the storage limit (although it says computed
1224MB).

16/10/11 13:18:36 WARN storage.MemoryStore: Not enough space to cache
rdd_6_0 in memory! (computed 1224.3 MB so far)
16/10/11 13:18:36 INFO storage.MemoryStore: Memory use = 331.8 KB (blocks)
+ 1224.3 MB (scratch space shared across 2 tasks(s)) = 1224.6 MB. Storage
limit = 1397.3 MB.

Therefore, I repartitioned the RDDs for better memory utilisation, wich
resolved the issue.

Kind regards,

Guru


On 11 October 2016 at 11:23, diplomatic Guru <diplomaticg...@gmail.com>
wrote:

> @Song, I have called an action but it did not cache as you can see in the
> provided screenshot on my original email. It has cahced into Disk but not
> memory.
>
> @Chin Wei Low, I have 15GB memory allocated which is more than the dataset
> size.
>
> Any other suggestion please?
>
>
> Kind regards,
>
> Guru
>
> On 11 October 2016 at 03:34, Chin Wei Low <lowchin...@gmail.com> wrote:
>
>> Hi,
>>
>> Your RDD is 5GB, perhaps it is too large to fit into executor's storage
>> memory. You can refer to the Executors tab in Spark UI to check the
>> available memory for storage for each of the executor.
>>
>> Regards,
>> Chin Wei
>>
>> On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru <
>> diplomaticg...@gmail.com> wrote:
>>
>>> Hello team,
>>>
>>> Spark version: 1.6.0
>>>
>>> I'm trying to persist done data into memory for reusing them. However,
>>> when I call rdd.cache() OR  rdd.persist(StorageLevel.MEMORY_ONLY())  it
>>> does not store the data as I can not see any rdd information under WebUI
>>> (Storage Tab).
>>>
>>> Therefore I tried rdd.persist(StorageLevel.MEMORY_AND_DISK()), for
>>> which it stored the data into Disk only as shown in below screenshot:
>>>
>>> [image: Inline images 2]
>>>
>>> Do you know why the memory is not being used?
>>>
>>> Is there a configuration in cluster level to stop jobs from storing data
>>> into memory altogether?
>>>
>>>
>>> Please let me know.
>>>
>>> Thanks
>>>
>>> Guru
>>>
>>>
>>
>


Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
@Song, I have called an action but it did not cache as you can see in the
provided screenshot on my original email. It has cahced into Disk but not
memory.

@Chin Wei Low, I have 15GB memory allocated which is more than the dataset
size.

Any other suggestion please?


Kind regards,

Guru

On 11 October 2016 at 03:34, Chin Wei Low  wrote:

> Hi,
>
> Your RDD is 5GB, perhaps it is too large to fit into executor's storage
> memory. You can refer to the Executors tab in Spark UI to check the
> available memory for storage for each of the executor.
>
> Regards,
> Chin Wei
>
> On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru  > wrote:
>
>> Hello team,
>>
>> Spark version: 1.6.0
>>
>> I'm trying to persist done data into memory for reusing them. However,
>> when I call rdd.cache() OR  rdd.persist(StorageLevel.MEMORY_ONLY())  it
>> does not store the data as I can not see any rdd information under WebUI
>> (Storage Tab).
>>
>> Therefore I tried rdd.persist(StorageLevel.MEMORY_AND_DISK()), for which
>> it stored the data into Disk only as shown in below screenshot:
>>
>> [image: Inline images 2]
>>
>> Do you know why the memory is not being used?
>>
>> Is there a configuration in cluster level to stop jobs from storing data
>> into memory altogether?
>>
>>
>> Please let me know.
>>
>> Thanks
>>
>> Guru
>>
>>
>


Re: [Spark] RDDs are not persisting in memory

2016-10-10 Thread Chin Wei Low
Hi,

Your RDD is 5GB, perhaps it is too large to fit into executor's storage
memory. You can refer to the Executors tab in Spark UI to check the
available memory for storage for each of the executor.

Regards,
Chin Wei

On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru 
wrote:

> Hello team,
>
> Spark version: 1.6.0
>
> I'm trying to persist done data into memory for reusing them. However,
> when I call rdd.cache() OR  rdd.persist(StorageLevel.MEMORY_ONLY())  it
> does not store the data as I can not see any rdd information under WebUI
> (Storage Tab).
>
> Therefore I tried rdd.persist(StorageLevel.MEMORY_AND_DISK()), for which
> it stored the data into Disk only as shown in below screenshot:
>
> [image: Inline images 2]
>
> Do you know why the memory is not being used?
>
> Is there a configuration in cluster level to stop jobs from storing data
> into memory altogether?
>
>
> Please let me know.
>
> Thanks
>
> Guru
>
>


[Spark] RDDs are not persisting in memory

2016-10-10 Thread diplomatic Guru
Hello team,

Spark version: 1.6.0

I'm trying to persist done data into memory for reusing them. However, when
I call rdd.cache() OR  rdd.persist(StorageLevel.MEMORY_ONLY())  it does not
store the data as I can not see any rdd information under WebUI (Storage
Tab).

Therefore I tried rdd.persist(StorageLevel.MEMORY_AND_DISK()), for which it
stored the data into Disk only as shown in below screenshot:

[image: Inline images 2]

Do you know why the memory is not being used?

Is there a configuration in cluster level to stop jobs from storing data
into memory altogether?


Please let me know.

Thanks

Guru


Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
Thanks Daniel.
Do you have any code fragments on using CoGroups or Joins across 2 RDDs ?
I don't think that index would help much because this is an N x M
operation, examining each cell of each RDD.  Each comparison is complex as
it needs to peer into a complex JSON


On Mon, Aug 15, 2016 at 1:24 PM, Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> There's no real way of doing nested for-loops with RDD's because the whole
> idea is that you could have so much data in the RDD that it would be really
> ugly to store it all in one worker.
>
> There are, however, ways to handle what you're asking about.
>
> I would personally use something like CoGroup or Join between the two
> RDDs. if index matters, you can use ZipWithIndex on both before you join
> and then see which indexes match up.
>
> On Mon, Aug 15, 2016 at 1:15 PM Eric Ho <e...@analyticsmd.com> wrote:
>
>> I've nested foreach loops like this:
>>
>>   for i in A[i] do:
>> for j in B[j] do:
>>   append B[j] to some list if B[j] 'matches' A[i] in some fashion.
>>
>> Each element in A or B is some complex structure like:
>> (
>>   some complex JSON,
>>   some number
>> )
>>
>> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)),
>> how would my code look ?
>> Are there any RRD operators that would allow me to loop thru both RRDs
>> like the above procedural code ?
>> I can't find any RRD operators nor any code fragments that would allow me
>> to do this.
>>
>> Thing is: by that time I composed RRD(A), this RRD would have contain
>> elements in array B as well as array A.
>> Same argument for RRD(B).
>>
>> Any pointers much appreciated.
>>
>> Thanks.
>>
>>
>> --
>>
>> -eric ho
>>
>>


-- 

-eric ho


Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Daniel Imberman
There's no real way of doing nested for-loops with RDD's because the whole
idea is that you could have so much data in the RDD that it would be really
ugly to store it all in one worker.

There are, however, ways to handle what you're asking about.

I would personally use something like CoGroup or Join between the two RDDs.
if index matters, you can use ZipWithIndex on both before you join and then
see which indexes match up.

On Mon, Aug 15, 2016 at 1:15 PM Eric Ho <e...@analyticsmd.com> wrote:

> I've nested foreach loops like this:
>
>   for i in A[i] do:
> for j in B[j] do:
>   append B[j] to some list if B[j] 'matches' A[i] in some fashion.
>
> Each element in A or B is some complex structure like:
> (
>   some complex JSON,
>   some number
> )
>
> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)),
> how would my code look ?
> Are there any RRD operators that would allow me to loop thru both RRDs
> like the above procedural code ?
> I can't find any RRD operators nor any code fragments that would allow me
> to do this.
>
> Thing is: by that time I composed RRD(A), this RRD would have contain
> elements in array B as well as array A.
> Same argument for RRD(B).
>
> Any pointers much appreciated.
>
> Thanks.
>
>
> --
>
> -eric ho
>
>


How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
I've nested foreach loops like this:

  for i in A[i] do:
for j in B[j] do:
  append B[j] to some list if B[j] 'matches' A[i] in some fashion.

Each element in A or B is some complex structure like:
(
  some complex JSON,
  some number
)

Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)), how
would my code look ?
Are there any RRD operators that would allow me to loop thru both RRDs like
the above procedural code ?
I can't find any RRD operators nor any code fragments that would allow me
to do this.

Thing is: by that time I composed RRD(A), this RRD would have contain
elements in array B as well as array A.
Same argument for RRD(B).

Any pointers much appreciated.

Thanks.


-- 

-eric ho


Re: how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Jörn Franke
Depends on the size of the arrays, but is it what you want to achieve similar 
to a join?

> On 15 Aug 2016, at 20:12, Eric Ho <e...@analyticsmd.com> wrote:
> 
> Hi,
> 
> I've two nested-for loops like this:
> 
> for all elements in Array A do:
> 
> for all elements in Array B do:
> 
> compare a[3] with b[4] see if they 'match' and if match, return that element;
> 
> If I were to represent Arrays A and B as 2 separate RDDs, how would my code 
> look like ? 
> 
> I couldn't find any RDD functions that would do this for me efficiently. I 
> don't really want elements of RDD(A) and RDD(B) flying all over the network 
> piecemeal...
> 
> THanks.
> 
> -- 
> 
> -eric ho
> 


how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Eric Ho
Hi,

I've two nested-for loops like this:

*for all elements in Array A do:*

*for all elements in Array B do:*

*compare a[3] with b[4] see if they 'match' and if match, return that
element;*

If I were to represent Arrays A and B as 2 separate RDDs, how would my code
look like ?

I couldn't find any RDD functions that would do this for me efficiently. I
don't really want elements of RDD(A) and RDD(B) flying all over the network
piecemeal...
THanks.

-- 

-eric ho


groupBy cannot handle large RDDs

2016-06-29 Thread Kaiyin Zhong
Could anyone have a look at this? It looks like a bug:

http://stackoverflow.com/questions/38106554/groupby-cannot-handle-large-rdds

Best regards,

Kaiyin ZHONG


Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Actually I was writing a code for the Connected Components algorithm. In
that I have to keep track of a variable called vertex number which keeps on
getting incremented based on the number of triples it encounters in a line.
This variable should be available at all the nodes and all the partitions.
The way I want to keep track of it is by incorporating it in the index of
every line. By default, the number of triples are two in a line. But in
some cases there maybe three triples also. So based on the number of
triples a line has, I want to increment its index by that number and the
next line should take the index of the previous line and increment it by
the number of triples it has.

For example:

  asdas asdas,0

   asdasd,1

In this case the final aggregated vertex number should be 5 as there are 2
triples in the first line and 3 triples in the second.

Considering the default case, the index numbers of the first and second
line should be 2 and 4 respectively. But because there is an extra triple
in the second line in its third field, the index number of it should be 5
and not 4. There is no pattern in the occurrence of the extra triple in a
line which makes it hard to keep track of the vertex number. So the
modified zipWithIndex function that I want to write should give me the
following output:

  asdas asdas,2

   asdasd,5

I hope I clearly explained myself. I am not so sure if this is the proper
approach. Maybe you could suggest me a better approach if there is any.
On 29-Jun-2016 6:31 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:

> Since the data.length is variable, I am not sure whether mixing data.length
> and the index makes sense.
>
> Can you describe your use case in bit more detail ?
>
> Thanks
>
> On Tue, Jun 28, 2016 at 11:34 AM, Punit Naik <naik.puni...@gmail.com>
> wrote:
>
>> Hi Ted
>>
>> So would the tuple look like: (x._1, split.startIndex + x._2 +
>> x._1.length) ?
>>
>> On Tue, Jun 28, 2016 at 11:09 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Please take a look at:
>>> core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala
>>>
>>> In compute() method:
>>> val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
>>> firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
>>>   (x._1, split.startIndex + x._2)
>>>
>>> You can modify the second component of the tuple to take data.length
>>> into account.
>>>
>>> On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I wanted to change the functioning of the "zipWithIndex" function for
>>>> spark RDDs in which the output of the function is, just for an example,
>>>>  "(data, prev_index+data.length)" instead of "(data,prev_index+1)".
>>>>
>>>> How can I do this?
>>>>
>>>> --
>>>> Thank You
>>>>
>>>> Regards
>>>>
>>>> Punit Naik
>>>>
>>>
>>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>


Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu
Since the data.length is variable, I am not sure whether mixing data.length
and the index makes sense.

Can you describe your use case in bit more detail ?

Thanks

On Tue, Jun 28, 2016 at 11:34 AM, Punit Naik <naik.puni...@gmail.com> wrote:

> Hi Ted
>
> So would the tuple look like: (x._1, split.startIndex + x._2 +
> x._1.length) ?
>
> On Tue, Jun 28, 2016 at 11:09 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Please take a look at:
>> core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala
>>
>> In compute() method:
>> val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
>> firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
>>   (x._1, split.startIndex + x._2)
>>
>> You can modify the second component of the tuple to take data.length
>> into account.
>>
>> On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I wanted to change the functioning of the "zipWithIndex" function for
>>> spark RDDs in which the output of the function is, just for an example,
>>>  "(data, prev_index+data.length)" instead of "(data,prev_index+1)".
>>>
>>> How can I do this?
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>


Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Hi Ted

So would the tuple look like: (x._1, split.startIndex + x._2 + x._1.length)
?

On Tue, Jun 28, 2016 at 11:09 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala
>
> In compute() method:
> val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
> firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
>   (x._1, split.startIndex + x._2)
>
> You can modify the second component of the tuple to take data.length into
> account.
>
> On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com>
> wrote:
>
>> Hi
>>
>> I wanted to change the functioning of the "zipWithIndex" function for
>> spark RDDs in which the output of the function is, just for an example,
>>  "(data, prev_index+data.length)" instead of "(data,prev_index+1)".
>>
>> How can I do this?
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>


-- 
Thank You

Regards

Punit Naik


Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu
Please take a look at:
core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala

In compute() method:
val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
  (x._1, split.startIndex + x._2)

You can modify the second component of the tuple to take data.length into
account.

On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com> wrote:

> Hi
>
> I wanted to change the functioning of the "zipWithIndex" function for
> spark RDDs in which the output of the function is, just for an example,
>  "(data, prev_index+data.length)" instead of "(data,prev_index+1)".
>
> How can I do this?
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>


Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Hi

I wanted to change the functioning of the "zipWithIndex" function for spark
RDDs in which the output of the function is, just for an example,  "(data,
prev_index+data.length)" instead of "(data,prev_index+1)".

How can I do this?

-- 
Thank You

Regards

Punit Naik


Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
By repartition I think you mean coalesce() where you would get one parquet file 
per partition? 

And this would be a new immutable copy so that you would want to write this new 
RDD to a different HDFS directory? 

-Mike

> On Jun 21, 2016, at 8:06 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> 
> wrote:
> 
> Apurva, 
> 
> I'd say you have to apply repartition just once to the RDD that is union of 
> all your files.
> And it has to be done right before you do anything else.
> 
> If something is not needed on your files, then the sooner you project, the 
> better.
> 
> Hope, this helps.
> 
> --
> Be well!
> Jean Morozov
> 
> On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan <apurva3...@gmail.com 
> <mailto:apurva3...@gmail.com>> wrote:
> Hello,
> 
> I am trying to combine several small text files (each file is approx hundreds 
> of MBs to 2-3 gigs) into one big parquet file. 
> 
> I am loading each one of them and trying to take a union, however this leads 
> to enormous amounts of partitions, as union keeps on adding the partitions of 
> the input RDDs together.
> 
> I also tried loading all the files via wildcard, but that behaves almost the 
> same as union i.e. generates a lot of partitions.
> 
> One of the approach that I thought was to reparititon the rdd generated after 
> each union and then continue the process, but I don't know how efficient that 
> is.
> 
> Has anyone came across this kind of thing before?
> 
> - Apurva 
> 
> 
> 



Re: Union of multiple RDDs

2016-06-21 Thread Eugene Morozov
Apurva,

I'd say you have to apply repartition just once to the RDD that is union of
all your files.
And it has to be done right before you do anything else.

If something is not needed on your files, then the sooner you project, the
better.

Hope, this helps.

--
Be well!
Jean Morozov

On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan <apurva3...@gmail.com> wrote:

> Hello,
>
> I am trying to combine several small text files (each file is approx
> hundreds of MBs to 2-3 gigs) into one big parquet file.
>
> I am loading each one of them and trying to take a union, however this
> leads to enormous amounts of partitions, as union keeps on adding the
> partitions of the input RDDs together.
>
> I also tried loading all the files via wildcard, but that behaves almost
> the same as union i.e. generates a lot of partitions.
>
> One of the approach that I thought was to reparititon the rdd generated
> after each union and then continue the process, but I don't know how
> efficient that is.
>
> Has anyone came across this kind of thing before?
>
> - Apurva
>
>
>


Union of multiple RDDs

2016-06-21 Thread Apurva Nandan
Hello,

I am trying to combine several small text files (each file is approx
hundreds of MBs to 2-3 gigs) into one big parquet file.

I am loading each one of them and trying to take a union, however this
leads to enormous amounts of partitions, as union keeps on adding the
partitions of the input RDDs together.

I also tried loading all the files via wildcard, but that behaves almost
the same as union i.e. generates a lot of partitions.

One of the approach that I thought was to reparititon the rdd generated
after each union and then continue the process, but I don't know how
efficient that is.

Has anyone came across this kind of thing before?

- Apurva


Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
Indeed!

I wasn't able to get this to work in cluster mode, yet, but increasing
driver and executor stack sizes in client mode (still running on a YARN EMR
cluster) got it to work! I'll fiddle more.

FWIW, I used

spark-submit --deploy-mode client --conf
"spark.executor.extraJavaOptions=-XX:ThreadStackSize=81920" --conf
"spark.driver.extraJavaOptions=-XX:ThreadStackSize=81920" 

Thank you so much!

On Sun, Jun 5, 2016 at 2:34 PM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:

> Everett,
>
> try to increase thread stack size. To do that run your application with
> the following options (my app is a web application, so you might adjust
> something): -XX:ThreadStackSize=81920
> -Dspark.executor.extraJavaOptions="-XX:ThreadStackSize=81920"
>
> The number 81920 is memory in KB. You could try smth less. It's pretty
> memory consuming to have 80M for each thread (very simply there might be
> 100 of them), but this is just a workaround. This is configuration that I
> use to train random forest with input of 400k samples.
>
> Hope this helps.
>
> --
> Be well!
> Jean Morozov
>
> On Sun, Jun 5, 2016 at 11:17 PM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> Hi!
>>
>> I have a fairly simple Spark (1.6.1) Java RDD-based program that's
>> scanning through lines of about 1000 large text files of records and
>> computing some metrics about each line (record type, line length, etc).
>> Most are identical so I'm calling distinct().
>>
>> In the loop over the list of files, I'm saving up the resulting RDDs into
>> a List. After the loop, I use the JavaSparkContext union(JavaRDD...
>> rdds) method to collapse the tables into one.
>>
>> Like this --
>>
>> List<JavaRDD> allMetrics = ...
>> for (int i = 0; i < files.size(); i++) {
>>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>>JavaRDD distinctFileMetrics =
>> lines.flatMap(fn).distinct();
>>
>>allMetrics.add(distinctFileMetrics);
>> }
>>
>> JavaRDD finalOutput =
>> jsc.union(allMetrics.toArray(...)).coalesce(10);
>> finalOutput.saveAsTextFile(...);
>>
>> There are posts suggesting
>> <https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error>
>> that using JavaRDD union(JavaRDD other) many times creates a long
>> lineage that results in a StackOverflowError.
>>
>> However, I'm seeing the StackOverflowError even with JavaSparkContext
>> union(JavaRDD... rdds).
>>
>> Should this still be happening?
>>
>> I'm using the work-around from this 2014 thread
>> <http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tt5649.html#a5752>,
>>  shown
>> below, which requires checkpointing to HDFS every N iterations, but it's
>> ugly and decreases performance.
>>
>> Is there a lighter weight way to compact the lineage? It looks like at
>> some point there might've been a "local checkpoint
>> <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-checkpointing.html>"
>> feature?
>>
>> Work-around:
>>
>> List<JavaRDD> allMetrics = ...
>> for (int i = 0; i < files.size(); i++) {
>>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>>JavaRDD distinctFileMetrics =
>> lines.flatMap(fn).distinct();
>>allMetrics.add(distinctFileMetrics);
>>
>>// Union and checkpoint occasionally to reduce lineage
>>if (i % tablesPerCheckpoint == 0) {
>>JavaRDD dataSoFar =
>>jsc.union(allMetrics.toArray(...));
>>dataSoFar.checkpoint();
>>dataSoFar.count();
>>allMetrics.clear();
>>allMetrics.add(dataSoFar);
>>}
>> }
>>
>> When the StackOverflowError happens, it's a long trace starting with --
>>
>> 16/06/05 18:01:29 INFO YarnAllocator: Driver requested a total number of 
>> 20823 executor(s).
>> 16/06/05 18:01:29 WARN ApplicationMaster: Reporter thread fails 3 time(s) in 
>> a row.
>> java.lang.StackOverflowError
>>  at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>  at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>  at 
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>  at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>
>> ...
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Eugene Morozov
Everett,

try to increase thread stack size. To do that run your application with the
following options (my app is a web application, so you might adjust
something): -XX:ThreadStackSize=81920
-Dspark.executor.extraJavaOptions="-XX:ThreadStackSize=81920"

The number 81920 is memory in KB. You could try smth less. It's pretty
memory consuming to have 80M for each thread (very simply there might be
100 of them), but this is just a workaround. This is configuration that I
use to train random forest with input of 400k samples.

Hope this helps.

--
Be well!
Jean Morozov

On Sun, Jun 5, 2016 at 11:17 PM, Everett Anderson <ever...@nuna.com.invalid>
wrote:

> Hi!
>
> I have a fairly simple Spark (1.6.1) Java RDD-based program that's
> scanning through lines of about 1000 large text files of records and
> computing some metrics about each line (record type, line length, etc).
> Most are identical so I'm calling distinct().
>
> In the loop over the list of files, I'm saving up the resulting RDDs into
> a List. After the loop, I use the JavaSparkContext union(JavaRDD...
> rdds) method to collapse the tables into one.
>
> Like this --
>
> List<JavaRDD> allMetrics = ...
> for (int i = 0; i < files.size(); i++) {
>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>JavaRDD distinctFileMetrics =
> lines.flatMap(fn).distinct();
>
>allMetrics.add(distinctFileMetrics);
> }
>
> JavaRDD finalOutput =
> jsc.union(allMetrics.toArray(...)).coalesce(10);
> finalOutput.saveAsTextFile(...);
>
> There are posts suggesting
> <https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error>
> that using JavaRDD union(JavaRDD other) many times creates a long
> lineage that results in a StackOverflowError.
>
> However, I'm seeing the StackOverflowError even with JavaSparkContext
> union(JavaRDD... rdds).
>
> Should this still be happening?
>
> I'm using the work-around from this 2014 thread
> <http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tt5649.html#a5752>,
>  shown
> below, which requires checkpointing to HDFS every N iterations, but it's
> ugly and decreases performance.
>
> Is there a lighter weight way to compact the lineage? It looks like at
> some point there might've been a "local checkpoint
> <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-checkpointing.html>"
> feature?
>
> Work-around:
>
> List<JavaRDD> allMetrics = ...
> for (int i = 0; i < files.size(); i++) {
>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>JavaRDD distinctFileMetrics =
> lines.flatMap(fn).distinct();
>allMetrics.add(distinctFileMetrics);
>
>// Union and checkpoint occasionally to reduce lineage
>if (i % tablesPerCheckpoint == 0) {
>JavaRDD dataSoFar =
>jsc.union(allMetrics.toArray(...));
>dataSoFar.checkpoint();
>dataSoFar.count();
>allMetrics.clear();
>allMetrics.add(dataSoFar);
>}
> }
>
> When the StackOverflowError happens, it's a long trace starting with --
>
> 16/06/05 18:01:29 INFO YarnAllocator: Driver requested a total number of 
> 20823 executor(s).
> 16/06/05 18:01:29 WARN ApplicationMaster: Reporter thread fails 3 time(s) in 
> a row.
> java.lang.StackOverflowError
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>
> ...
>
> Thanks!
>
> - Everett
>
>


StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
Hi!

I have a fairly simple Spark (1.6.1) Java RDD-based program that's scanning
through lines of about 1000 large text files of records and computing some
metrics about each line (record type, line length, etc). Most are identical
so I'm calling distinct().

In the loop over the list of files, I'm saving up the resulting RDDs into a
List. After the loop, I use the JavaSparkContext union(JavaRDD... rdds)
method to collapse the tables into one.

Like this --

List<JavaRDD> allMetrics = ...
for (int i = 0; i < files.size(); i++) {
   JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
   JavaRDD distinctFileMetrics =
lines.flatMap(fn).distinct();

   allMetrics.add(distinctFileMetrics);
}

JavaRDD finalOutput =
jsc.union(allMetrics.toArray(...)).coalesce(10);
finalOutput.saveAsTextFile(...);

There are posts suggesting
<https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error>
that using JavaRDD union(JavaRDD other) many times creates a long
lineage that results in a StackOverflowError.

However, I'm seeing the StackOverflowError even with JavaSparkContext
union(JavaRDD... rdds).

Should this still be happening?

I'm using the work-around from this 2014 thread
<http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tt5649.html#a5752>,
shown
below, which requires checkpointing to HDFS every N iterations, but it's
ugly and decreases performance.

Is there a lighter weight way to compact the lineage? It looks like at some
point there might've been a "local checkpoint
<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-checkpointing.html>"
feature?

Work-around:

List<JavaRDD> allMetrics = ...
for (int i = 0; i < files.size(); i++) {
   JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
   JavaRDD distinctFileMetrics =
lines.flatMap(fn).distinct();
   allMetrics.add(distinctFileMetrics);

   // Union and checkpoint occasionally to reduce lineage
   if (i % tablesPerCheckpoint == 0) {
   JavaRDD dataSoFar =
   jsc.union(allMetrics.toArray(...));
   dataSoFar.checkpoint();
   dataSoFar.count();
   allMetrics.clear();
   allMetrics.add(dataSoFar);
   }
}

When the StackOverflowError happens, it's a long trace starting with --

16/06/05 18:01:29 INFO YarnAllocator: Driver requested a total number
of 20823 executor(s).
16/06/05 18:01:29 WARN ApplicationMaster: Reporter thread fails 3
time(s) in a row.
java.lang.StackOverflowError
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

...

Thanks!

- Everett


Re: Using data frames to join separate RDDs in spark streaming

2016-06-05 Thread Cyril Scetbon
Problem solved by creating only one RDD.
> On Jun 1, 2016, at 14:05, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> It seems that to join a DStream with a RDD I can use :
> 
> mgs.transform(rdd => rdd.join(rdd1))
> 
> or
> 
> mgs.foreachRDD(rdd => rdd.join(rdd1))
> 
> But, I can't see why rdd1.toDF("id","aid") really causes SPARK-5063
> 
> 
>> On Jun 1, 2016, at 12:00, Cyril Scetbon <cyril.scet...@free.fr> wrote:
>> 
>> Hi guys,
>> 
>> I have a 2 input data streams that I want to join using Dataframes and 
>> unfortunately I get the message produced by 
>> https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1  
>> in (2) :
>> 
>> (1)
>> val rdd1 = sc.esRDD(es_resource.toLowerCase, query)
>>.map(r => (r._1, r._2))
>> 
>> (2)
>> mgs.map(x => x._1)
>>  .foreachRDD { rdd =>
>>val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
>>import sqlContext.implicits._
>> 
>>val df_aids = rdd.toDF("id")
>> 
>>val df = rdd1.toDF("id", "aid")
>> 
>>df.select(explode(df("aid")).as("aid"), df("id"))
>>   .join(df_aids, $"aid" === df_aids("id"))
>>   .select(df("id"), df_aids("id"))
>>  .
>>  }
>> 
>> Is there a way to still use Dataframes to do it or I need to do everything 
>> using RDDs join only ?
>> And If I need to use only RDDs join, how to do it ? as I have a RDD (rdd1) 
>> and a DStream (mgs) ?
>> 
>> Thanks
>> -- 
>> Cyril SCETBON
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
It seems that to join a DStream with a RDD I can use :

mgs.transform(rdd => rdd.join(rdd1))

or

mgs.foreachRDD(rdd => rdd.join(rdd1))

But, I can't see why rdd1.toDF("id","aid") really causes SPARK-5063

 
> On Jun 1, 2016, at 12:00, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> Hi guys,
> 
> I have a 2 input data streams that I want to join using Dataframes and 
> unfortunately I get the message produced by 
> https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1  
> in (2) :
> 
> (1)
> val rdd1 = sc.esRDD(es_resource.toLowerCase, query)
> .map(r => (r._1, r._2))
> 
> (2)
> mgs.map(x => x._1)
>   .foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
> import sqlContext.implicits._
> 
> val df_aids = rdd.toDF("id")
> 
> val df = rdd1.toDF("id", "aid")
> 
> df.select(explode(df("aid")).as("aid"), df("id"))
>.join(df_aids, $"aid" === df_aids("id"))
>.select(df("id"), df_aids("id"))
>   .
>   }
> 
> Is there a way to still use Dataframes to do it or I need to do everything 
> using RDDs join only ?
> And If I need to use only RDDs join, how to do it ? as I have a RDD (rdd1) 
> and a DStream (mgs) ?
> 
> Thanks
> -- 
> Cyril SCETBON
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
Hi guys,

I have a 2 input data streams that I want to join using Dataframes and 
unfortunately I get the message produced by 
https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1  in 
(2) :

(1)
val rdd1 = sc.esRDD(es_resource.toLowerCase, query)
 .map(r => (r._1, r._2))

(2)
mgs.map(x => x._1)
   .foreachRDD { rdd =>
 val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
 import sqlContext.implicits._

 val df_aids = rdd.toDF("id")

 val df = rdd1.toDF("id", "aid")

 df.select(explode(df("aid")).as("aid"), df("id"))
.join(df_aids, $"aid" === df_aids("id"))
.select(df("id"), df_aids("id"))
   .
   }

Is there a way to still use Dataframes to do it or I need to do everything 
using RDDs join only ?
And If I need to use only RDDs join, how to do it ? as I have a RDD (rdd1) and 
a DStream (mgs) ?

Thanks
-- 
Cyril SCETBON


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Sonal Goyal
You can look at ways to group records from both rdds together instead of
doing Cartesian.  Say generate pair rdd from each with first letter as key.
Then do a partition and a join.
On May 25, 2016 8:04 PM, "Priya Ch" <learnings.chitt...@gmail.com> wrote:

> Hi,
>   RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
> like to filter out the strings that have greater than 85% match and
> generate a score for it which is used in the susbsequent calculations.
>
> I tried generating pair rdd from both rdds A and B with same key for all
> the records. Now performing A.join(B) is also resulting in huge execution
> time..
>
> How do I go about with map and reduce here ? To generate pairs from 2 rdds
> I dont think map can be used because we cannot have rdd inside another rdd.
>
> Would be glad if you can throw me some light on this.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 7:39 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Solr or Elastic search provide much more functionality and are faster in
>> this context. The decision for or against them depends on your current and
>> future use cases. Your current use case is still very abstract so in order
>> to get a more proper recommendation you need to provide more details
>> including size of dataset, what you do with the result of the matching do
>> you just need the match number or also the pairs in the results etc.
>>
>> Your concrete problem can also be solved in Spark (though it is not the
>> best and most efficient tool for this, but it has other strength) using the
>> map reduce steps. There are different ways to implement this (Generate
>> pairs from the input datasets in the map step or (maybe less recommendable)
>> broadcast the smaller dataset to all nodes and do the matching with the
>> bigger dataset there.
>> This highly depends on the data in your data set. How they compare in
>> size etc.
>>
>>
>>
>> On 25 May 2016, at 13:27, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>
>> Why do i need to deploy solr for text anaytics...i have files placed in
>> HDFS. just need to look for matches against each string in both files and
>> generate those records whose match is > 85%. We trying to Fuzzy match
>> logic.
>>
>> How can use map/reduce operations across 2 rdds ?
>>
>> Thanks,
>> Padma Ch
>>
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>>
>>> Alternatively depending on the exact use case you may employ solr on
>>> Hadoop for text analytics
>>>
>>> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com>
>>> wrote:
>>> >
>>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>>> B of
>>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>>> need
>>> > to check the matches found in rdd B as such for string "hi" i have to
>>> check
>>> > the matches against all strings in RDD B which means I need generate
>>> every
>>> > possible combination r
>>>
>>
>>
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi,
  RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
like to filter out the strings that have greater than 85% match and
generate a score for it which is used in the susbsequent calculations.

I tried generating pair rdd from both rdds A and B with same key for all
the records. Now performing A.join(B) is also resulting in huge execution
time..

How do I go about with map and reduce here ? To generate pairs from 2 rdds
I dont think map can be used because we cannot have rdd inside another rdd.

Would be glad if you can throw me some light on this.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 7:39 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Solr or Elastic search provide much more functionality and are faster in
> this context. The decision for or against them depends on your current and
> future use cases. Your current use case is still very abstract so in order
> to get a more proper recommendation you need to provide more details
> including size of dataset, what you do with the result of the matching do
> you just need the match number or also the pairs in the results etc.
>
> Your concrete problem can also be solved in Spark (though it is not the
> best and most efficient tool for this, but it has other strength) using the
> map reduce steps. There are different ways to implement this (Generate
> pairs from the input datasets in the map step or (maybe less recommendable)
> broadcast the smaller dataset to all nodes and do the matching with the
> bigger dataset there.
> This highly depends on the data in your data set. How they compare in size
> etc.
>
>
>
> On 25 May 2016, at 13:27, Priya Ch <learnings.chitt...@gmail.com> wrote:
>
> Why do i need to deploy solr for text anaytics...i have files placed in
> HDFS. just need to look for matches against each string in both files and
> generate those records whose match is > 85%. We trying to Fuzzy match
> logic.
>
> How can use map/reduce operations across 2 rdds ?
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>>
>> Alternatively depending on the exact use case you may employ solr on
>> Hadoop for text analytics
>>
>> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com>
>> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>> B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>> need
>> > to check the matches found in rdd B as such for string "hi" i have to
>> check
>> > the matches against all strings in RDD B which means I need generate
>> every
>> > possible combination r
>>
>
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Solr or Elastic search provide much more functionality and are faster in this 
context. The decision for or against them depends on your current and future 
use cases. Your current use case is still very abstract so in order to get a 
more proper recommendation you need to provide more details including size of 
dataset, what you do with the result of the matching do you just need the match 
number or also the pairs in the results etc.

Your concrete problem can also be solved in Spark (though it is not the best 
and most efficient tool for this, but it has other strength) using the map 
reduce steps. There are different ways to implement this (Generate pairs from 
the input datasets in the map step or (maybe less recommendable) broadcast the 
smaller dataset to all nodes and do the matching with the bigger dataset there.
This highly depends on the data in your data set. How they compare in size etc.



> On 25 May 2016, at 13:27, Priya Ch <learnings.chitt...@gmail.com> wrote:
> 
> Why do i need to deploy solr for text anaytics...i have files placed in HDFS. 
> just need to look for matches against each string in both files and generate 
> those records whose match is > 85%. We trying to Fuzzy match logic. 
> 
> How can use map/reduce operations across 2 rdds ?
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> 
>> Alternatively depending on the exact use case you may employ solr on Hadoop 
>> for text analytics
>> 
>> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
>> > to check the matches found in rdd B as such for string "hi" i have to check
>> > the matches against all strings in RDD B which means I need generate every
>> > possible combination r
> 


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Why do i need to deploy solr for text anaytics...i have files placed in
HDFS. just need to look for matches against each string in both files and
generate those records whose match is > 85%. We trying to Fuzzy match
logic.

How can use map/reduce operations across 2 rdds ?

Thanks,
Padma Ch

On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> wrote:

>
> Alternatively depending on the exact use case you may employ solr on
> Hadoop for text analytics
>
> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com> wrote:
> >
> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B
> of
> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> > to check the matches found in rdd B as such for string "hi" i have to
> check
> > the matches against all strings in RDD B which means I need generate
> every
> > possible combination r
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke

Alternatively depending on the exact use case you may employ solr on Hadoop for 
text analytics 

> On 25 May 2016, at 12:57, Priya Ch  wrote:
> 
> Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> to check the matches found in rdd B as such for string "hi" i have to check
> the matches against all strings in RDD B which means I need generate every
> possible combination r

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke

No this is not needed, look at the map / reduce operations and the standard 
spark word count

> On 25 May 2016, at 12:57, Priya Ch  wrote:
> 
> Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> to check the matches found in rdd B as such for string "hi" i have to check
> the matches against all strings in RDD B which means I need generate every
> possible combination r

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
to check the matches found in rdd B as such for string "hi" i have to check
the matches against all strings in RDD B which means I need generate every
possible combination right.. Hence generating cartesian product and then
 using map transformation on cartesian rdd I am trying to check the matches
found.

Is there any better way I could do other than performaing cartesian. Till
now application took 30 mins and on top of that I see executor lost issues.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 4:22 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> What is the use case of this ? A Cartesian product is by definition slow
> in any system. Why do you need this? How long does your application take
> now?
>
> On 25 May 2016, at 12:42, Priya Ch <learnings.chitt...@gmail.com> wrote:
>
> I tried
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
> this is taking too much time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
>> parquet, orc, ...?
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com>
>> wrote:
>>
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>>> saveAsTextFile, trying to save it. However, this is also taking too much
>>> time.
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> Seems you'd be better off using DataFrame#join instead of  RDD
>>>> .cartesian
>>>> because it always needs shuffle operations which have alot of
>>>> overheads such as reflection, serialization, ...
>>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>>>> broadcast strategy.
>>>> This is a little more efficient than  RDD.cartesian.
>>>>
>>>> // maropu
>>>>
>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> It is basically a Cartesian join like RDBMS
>>>>>
>>>>> Example:
>>>>>
>>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>>
>>>>> The results of this query matches every row in the FinancialCodes
>>>>> table with every row in the FinancialData table.  Each row consists
>>>>> of all columns from the FinancialCodes table followed by all columns from
>>>>> the FinancialData table.
>>>>>
>>>>>
>>>>> Not very useful
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of
>>>>>> size 7 MB, A.cartesian(B) is taking too much time. Is there any 
>>>>>> bottleneck
>>>>>> in cartesian operation ?
>>>>>>
>>>>>> I am using spark 1.6.0 version
>>>>>>
>>>>>> Regards,
>>>>>> Padma Ch
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
What is the use case of this ? A Cartesian product is by definition slow in any 
system. Why do you need this? How long does your application take now?

> On 25 May 2016, at 12:42, Priya Ch <learnings.chitt...@gmail.com> wrote:
> 
> I tried 
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even 
> this is taking too much time.
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin@gmail.com> 
>> wrote:
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as 
>> parquet, orc, ...?
>> 
>> // maropu
>> 
>>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com> 
>>> wrote:
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I 
>>> am converting the joined dataframe to rdd (dataframe.rdd) and using 
>>> saveAsTextFile, trying to save it. However, this is also taking too much 
>>> time.
>>> 
>>> Thanks,
>>> Padma Ch
>>> 
>>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin@gmail.com> 
>>>> wrote:
>>>> Hi, 
>>>> 
>>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>>> because it always needs shuffle operations which have alot of overheads 
>>>> such as reflection, serialization, ...
>>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a 
>>>> broadcast strategy.
>>>> This is a little more efficient than  RDD.cartesian.
>>>> 
>>>> // maropu
>>>> 
>>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh 
>>>>> <mich.talebza...@gmail.com> wrote:
>>>>> It is basically a Cartesian join like RDBMS 
>>>>> 
>>>>> Example:
>>>>> 
>>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>> 
>>>>> The results of this query matches every row in the FinancialCodes table 
>>>>> with every row in the FinancialData table.  Each row consists of all 
>>>>> columns from the FinancialCodes table followed by all columns from the 
>>>>> FinancialData table.
>>>>> 
>>>>> Not very useful 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7 
>>>>>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in 
>>>>>> cartesian operation ?
>>>>>> 
>>>>>> I am using spark 1.6.0 version
>>>>>> 
>>>>>> Regards,
>>>>>> Padma Ch
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> ---
>>>> Takeshi Yamamuro
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
> 


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
I tried
dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
this is taking too much time.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
> parquet, orc, ...?
>
> // maropu
>
> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>> saveAsTextFile, trying to save it. However, this is also taking too much
>> time.
>>
>> Thanks,
>> Padma Ch
>>
>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>> because it always needs shuffle operations which have alot of overheads
>>> such as reflection, serialization, ...
>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>>> broadcast strategy.
>>> This is a little more efficient than  RDD.cartesian.
>>>
>>> // maropu
>>>
>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> It is basically a Cartesian join like RDBMS
>>>>
>>>> Example:
>>>>
>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>
>>>> The results of this query matches every row in the FinancialCodes table
>>>> with every row in the FinancialData table.  Each row consists of all
>>>> columns from the FinancialCodes table followed by all columns from the
>>>> FinancialData table.
>>>>
>>>>
>>>> Not very useful
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size
>>>>> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>>>> cartesian operation ?
>>>>>
>>>>> I am using spark 1.6.0 version
>>>>>
>>>>> Regards,
>>>>> Padma Ch
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
parquet, orc, ...?

// maropu

On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:

> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
> am converting the joined dataframe to rdd (dataframe.rdd) and using
> saveAsTextFile, trying to save it. However, this is also taking too much
> time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>> because it always needs shuffle operations which have alot of overheads
>> such as reflection, serialization, ...
>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>> broadcast strategy.
>> This is a little more efficient than  RDD.cartesian.
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It is basically a Cartesian join like RDBMS
>>>
>>> Example:
>>>
>>> SELECT * FROM FinancialCodes,  FinancialData
>>>
>>> The results of this query matches every row in the FinancialCodes table
>>> with every row in the FinancialData table.  Each row consists of all
>>> columns from the FinancialCodes table followed by all columns from the
>>> FinancialData table.
>>>
>>>
>>> Not very useful
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size
>>>> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>>> cartesian operation ?
>>>>
>>>> I am using spark 1.6.0 version
>>>>
>>>> Regards,
>>>> Padma Ch
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
am converting the joined dataframe to rdd (dataframe.rdd) and using
saveAsTextFile, trying to save it. However, this is also taking too much
time.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
> because it always needs shuffle operations which have alot of overheads
> such as reflection, serialization, ...
> In your case,  since the smaller table is 7mb, DataFrame#join uses a
> broadcast strategy.
> This is a little more efficient than  RDD.cartesian.
>
> // maropu
>
> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> It is basically a Cartesian join like RDBMS
>>
>> Example:
>>
>> SELECT * FROM FinancialCodes,  FinancialData
>>
>> The results of this query matches every row in the FinancialCodes table
>> with every row in the FinancialData table.  Each row consists of all
>> columns from the FinancialCodes table followed by all columns from the
>> FinancialData table.
>>
>>
>> Not very useful
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>> cartesian operation ?
>>>
>>> I am using spark 1.6.0 version
>>>
>>> Regards,
>>> Padma Ch
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Hi,

Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
because it always needs shuffle operations which have alot of overheads
such as reflection, serialization, ...
In your case,  since the smaller table is 7mb, DataFrame#join uses a
broadcast strategy.
This is a little more efficient than  RDD.cartesian.

// maropu

On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> It is basically a Cartesian join like RDBMS
>
> Example:
>
> SELECT * FROM FinancialCodes,  FinancialData
>
> The results of this query matches every row in the FinancialCodes table
> with every row in the FinancialData table.  Each row consists of all
> columns from the FinancialCodes table followed by all columns from the
> FinancialData table.
>
>
> Not very useful
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>
>> Hi All,
>>
>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>> cartesian operation ?
>>
>> I am using spark 1.6.0 version
>>
>> Regards,
>> Padma Ch
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Mich Talebzadeh
It is basically a Cartesian join like RDBMS

Example:

SELECT * FROM FinancialCodes,  FinancialData

The results of this query matches every row in the FinancialCodes table
with every row in the FinancialData table.  Each row consists of all
columns from the FinancialCodes table followed by all columns from the
FinancialData table.


Not very useful


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:

> Hi All,
>
>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
> cartesian operation ?
>
> I am using spark 1.6.0 version
>
> Regards,
> Padma Ch
>


Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi All,

  I have two RDDs A and B where in A is of size 30 MB and B is of size 7
MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
cartesian operation ?

I am using spark 1.6.0 version

Regards,
Padma Ch


How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2

How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of these 2 different
cse class type?
E.g first 11 fields mapped to Case1 while rest 6 fields mapped to Case2
Any pointer here or code snippet would be really helpful.


-- 
Thanks
Deepak


How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2

How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of these 2 different
cse class type?

-- 
Thanks
Deepak


Re: Confused - returning RDDs from functions

2016-05-13 Thread Dood

  
  
On 5/12/2016 10:01 PM, Holden Karau wrote:
This is not the expected behavior, can you maybe post
  the code where you are running into this?
  


Hello, thanks for replying!

Below is the function I took out from the code.

def converter(rdd: RDD[(String, JsValue)], param:String): RDD[(String, Int)] = {
  // I am breaking this down for future readability and ease of optimization
  // as a first attempt at solving this problem, I am not concerned with performance
  // and pretty, more with accuracy ;)
  // r1 will be an RDD containing only the "param" method of selection
  val r1:RDD[(String,JsValue)] = rdd.filter(x => (x._2 \ "field1" \ "field2").as[String].replace("\"","") == param.replace("\"",""))
  // r2 will be an RDD of Lists of fields (A1-Z1) with associated counts
  // remapFields returns a List[(String,Int)]
  val r2:RDD[List[(String,Int)]] = r1.map(x => remapFields(x._2 \ "extra"))
  // r3 will be flattened to enable grouping
  val r3:RDD[(String,Int)] = r2.flatMap(x => x)
  // now we can group by entity
  val r4:RDD[(String,Iterable[(String,Int)])] = r3.groupBy(x => x._1)
  // and produce a mapping of entity -> count pairs
  val r5:RDD[(String,Int)] = r4.map(x => (x._1, x._2.map(y => y._2).sum))
  // return the result
  r5
}

If I call on the above function and collectAsMap on the returned
RDD, I get an empty Map(). If I copy/paste this code into the
caller, I get the properly filled in Map.

I am fairly new to Spark and Scala so excuse any inefficiencies - my
priority was to be able to solve the problem in an obvious and
correct way and worry about making it pretty later. 

Thanks!

On Thursday, May 12, 2016, Dood@ODDO 
  wrote:
  Hello all,

I have been programming for years but this has me baffled.

I have an RDD[(String,Int)] that I return from a function after
extensive manipulation of an initial RDD of a different type.
When I return this RDD and initiate the .collectAsMap() on it
from the caller, I get an empty Map().

If I copy and paste the code from the function into the caller
(same exact code) and produce the same RDD and call
collectAsMap() on it, I get the Map with all the expected
information in it.

What gives?

Does Spark defy programming principles or am I crazy? ;-)

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

  
  
  
  -- 
  

  
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
  

  
  


  


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Confused - returning RDDs from functions

2016-05-12 Thread Holden Karau
This is not the expected behavior, can you maybe post the code where you
are running into this?

On Thursday, May 12, 2016, Dood@ODDO  wrote:

> Hello all,
>
> I have been programming for years but this has me baffled.
>
> I have an RDD[(String,Int)] that I return from a function after extensive
> manipulation of an initial RDD of a different type. When I return this RDD
> and initiate the .collectAsMap() on it from the caller, I get an empty
> Map().
>
> If I copy and paste the code from the function into the caller (same exact
> code) and produce the same RDD and call collectAsMap() on it, I get the Map
> with all the expected information in it.
>
> What gives?
>
> Does Spark defy programming principles or am I crazy? ;-)
>
> Thanks!
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Confused - returning RDDs from functions

2016-05-12 Thread Dood

Hello all,

I have been programming for years but this has me baffled.

I have an RDD[(String,Int)] that I return from a function after 
extensive manipulation of an initial RDD of a different type. When I 
return this RDD and initiate the .collectAsMap() on it from the caller, 
I get an empty Map().


If I copy and paste the code from the function into the caller (same 
exact code) and produce the same RDD and call collectAsMap() on it, I 
get the Map with all the expected information in it.


What gives?

Does Spark defy programming principles or am I crazy? ;-)

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDDs caching in typical machine learning use cases

2016-04-04 Thread Eugene Morozov
Hi,

Yes, I believe people do that. I also believe that SparkML is able to
figure out when to cache some internal RDD also. That's definitely true for
random forest algo. It doesn't harm to cache the same RDD twice, too.

But it's not clear what'd you want to know...

--
Be well!
Jean Morozov

On Sun, Apr 3, 2016 at 11:34 AM, Sergey <ser...@gmail.com> wrote:

> Hi Spark ML experts!
>
> Do you use RDDs caching somewhere together with ML lib to speed up
> calculation?
> I mean typical machine learning use cases.
> Train-test split, train, evaluate, apply model.
>
> Sergey.
>


RDDs caching in typical machine learning use cases

2016-04-03 Thread Sergey
Hi Spark ML experts!

Do you use RDDs caching somewhere together with ML lib to speed up
calculation?
I mean typical machine learning use cases.
Train-test split, train, evaluate, apply model.

Sergey.


Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Thamme Gowda N.
Hi Jeff,

Yes, you are absolutely right.
It is because of the RecordReader reusing the Writable Instance. I did not
anticipate this as it worked for text files.

Thank you so much for doing this.
 Your answer is accepted!


Best,
Thamme



--
*Thamme Gowda N. *
Grad Student at usc.edu
Twitter: @thammegowda
Website : http://scf.usc.edu/~tnarayan/

On Tue, Mar 22, 2016 at 9:00 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> Zhan's reply on stackoverflow is correct.
>
>
> down vote
>
> Please refer to the comments in sequenceFile.
>
> /** Get an RDD for a Hadoop SequenceFile with given key and value types. *
> * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
> object for each * record, directly caching the returned RDD or directly
> passing it to an aggregation or shuffle * operation will create many
> references to the same object. * If you plan to directly cache, sort, or
> aggregate Hadoop writable objects, you should first * copy them using a
> map function. */
>
>
>
> On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> I think I got the root cause, you can use Text.toString() to solve this
>> issue.  Because the Text is shared so the last record display multiple
>> times.
>>
>> On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> Looks like a spark bug. I can reproduce it for sequence file, but it
>>> works for text file.
>>>
>>> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com>
>>> wrote:
>>>
>>>> Hi spark experts,
>>>>
>>>> I am facing issues with cached RDDs. I noticed that few entries
>>>> get duplicated for n times when the RDD is cached.
>>>>
>>>> I asked a question on Stackoverflow with my code snippet to reproduce
>>>> it.
>>>>
>>>> I really appreciate  if you can visit
>>>> http://stackoverflow.com/q/36168827/1506477
>>>> and answer my question / give your comments.
>>>>
>>>> Or at the least confirm that it is a bug.
>>>>
>>>> Thanks in advance for your help!
>>>>
>>>> --
>>>> Thamme
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
Zhan's reply on stackoverflow is correct.


down vote

Please refer to the comments in sequenceFile.

/** Get an RDD for a Hadoop SequenceFile with given key and value types. *
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
object for each * record, directly caching the returned RDD or directly
passing it to an aggregation or shuffle * operation will create many
references to the same object. * If you plan to directly cache, sort, or
aggregate Hadoop writable objects, you should first * copy them using
a map function.
*/



On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> I think I got the root cause, you can use Text.toString() to solve this
> issue.  Because the Text is shared so the last record display multiple
> times.
>
> On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Looks like a spark bug. I can reproduce it for sequence file, but it
>> works for text file.
>>
>> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com>
>> wrote:
>>
>>> Hi spark experts,
>>>
>>> I am facing issues with cached RDDs. I noticed that few entries
>>> get duplicated for n times when the RDD is cached.
>>>
>>> I asked a question on Stackoverflow with my code snippet to reproduce it.
>>>
>>> I really appreciate  if you can visit
>>> http://stackoverflow.com/q/36168827/1506477
>>> and answer my question / give your comments.
>>>
>>> Or at the least confirm that it is a bug.
>>>
>>> Thanks in advance for your help!
>>>
>>> --
>>> Thamme
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
I think I got the root cause, you can use Text.toString() to solve this
issue.  Because the Text is shared so the last record display multiple
times.

On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> Looks like a spark bug. I can reproduce it for sequence file, but it works
> for text file.
>
> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com>
> wrote:
>
>> Hi spark experts,
>>
>> I am facing issues with cached RDDs. I noticed that few entries
>> get duplicated for n times when the RDD is cached.
>>
>> I asked a question on Stackoverflow with my code snippet to reproduce it.
>>
>> I really appreciate  if you can visit
>> http://stackoverflow.com/q/36168827/1506477
>> and answer my question / give your comments.
>>
>> Or at the least confirm that it is a bug.
>>
>> Thanks in advance for your help!
>>
>> --
>> Thamme
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
Looks like a spark bug. I can reproduce it for sequence file, but it works
for text file.

On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <tgow...@gmail.com> wrote:

> Hi spark experts,
>
> I am facing issues with cached RDDs. I noticed that few entries
> get duplicated for n times when the RDD is cached.
>
> I asked a question on Stackoverflow with my code snippet to reproduce it.
>
> I really appreciate  if you can visit
> http://stackoverflow.com/q/36168827/1506477
> and answer my question / give your comments.
>
> Or at the least confirm that it is a bug.
>
> Thanks in advance for your help!
>
> --
> Thamme
>



-- 
Best Regards

Jeff Zhang


[Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Thamme Gowda N.
Hi spark experts,

I am facing issues with cached RDDs. I noticed that few entries
get duplicated for n times when the RDD is cached.

I asked a question on Stackoverflow with my code snippet to reproduce it.

I really appreciate  if you can visit
http://stackoverflow.com/q/36168827/1506477
and answer my question / give your comments.

Or at the least confirm that it is a bug.

Thanks in advance for your help!

--
Thamme


Re: Can't zip RDDs with unequal numbers of partitions

2016-03-20 Thread Jakob Odersky
Can you share a snippet that reproduces the error? What was
spark.sql.autoBroadcastJoinThreshold before your last change?

On Thu, Mar 17, 2016 at 10:03 AM, Jiří Syrový <syrovy.j...@gmail.com> wrote:
> Hi,
>
> any idea what could be causing this issue? It started appearing after
> changing parameter
>
> spark.sql.autoBroadcastJoinThreshold to 10
>
>
> Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal
> numbers of partitions
> at
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.PartitionCoalescer.(CoalescedRDD.scala:172)
> at
> org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:85)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
> at
> org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:220)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
> ... 28 more
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can't zip RDDs with unequal numbers of partitions

2016-03-19 Thread Jiří Syrový
Unfortunately I can't share any snippet quickly as the code is generated,
but for now at least can share the plan. (See it here -
http://pastebin.dqd.cz/RAhm/)

After I've increased spark.sql.autoBroadcastJoinThreshold to 30 from
10 it went through without any problems. With 10 it was always
failing during the "planning" phase with the Exception above.

2016-03-17 22:05 GMT+01:00 Jakob Odersky <ja...@odersky.com>:

> Can you share a snippet that reproduces the error? What was
> spark.sql.autoBroadcastJoinThreshold before your last change?
>
> On Thu, Mar 17, 2016 at 10:03 AM, Jiří Syrový <syrovy.j...@gmail.com>
> wrote:
> > Hi,
> >
> > any idea what could be causing this issue? It started appearing after
> > changing parameter
> >
> > spark.sql.autoBroadcastJoinThreshold to 100000
> >
> >
> > Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with
> unequal
> > numbers of partitions
> > at
> >
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> > org.apache.spark.rdd.PartitionCoalescer.(CoalescedRDD.scala:172)
> > at
> > org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:85)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> >
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> >
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> >
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
> > at
> >
> org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:220)
> > at
> >
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
> > at
> >
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
> > at
> >
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
> > ... 28 more
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Can't zip RDDs with unequal numbers of partitions

2016-03-18 Thread Jiří Syrový
Hi,

any idea what could be causing this issue? It started appearing after
changing parameter



*spark.sql.autoBroadcastJoinThreshold to 10*

Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal
numbers of partitions
at
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.PartitionCoalescer.(CoalescedRDD.scala:172)
at
org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:85)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
at
org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:220)
at
org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
at
org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 28 more


Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
well the "hadoop" way is to save to a/b and a/c and read from a/* :)

On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Spark users and developers,
>
> anyone knows how to union two RDDs without the overhead of it?
>
> say rdd1.union(rdd2).saveTextFile(..)
> This requires a stage to union the 2 rdds before saveAsTextFile (2
> stages). Is there a way to skip the union step but have the contents of the
> two rdds save to the same output text file?
>
> Thank you!
>
> Jerry
>


Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
i am surprised union introduces a stage. UnionRDD should have only narrow
dependencies.

On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:

> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>
> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark users and developers,
>>
>> anyone knows how to union two RDDs without the overhead of it?
>>
>> say rdd1.union(rdd2).saveTextFile(..)
>> This requires a stage to union the 2 rdds before saveAsTextFile (2
>> stages). Is there a way to skip the union step but have the contents of the
>> two rdds save to the same output text file?
>>
>> Thank you!
>>
>> Jerry
>>
>
>


Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra
Agree with Koert that UnionRDD should have a narrow dependencies .
Although union of two RDDs increases the number of tasks to be executed (
rdd1.partitions + rdd2.partitions) .
If your two RDDs have same number of partitions , you can also use
zipPartitions, which causes lesser number of tasks, hence less overhead.

On Wed, Feb 3, 2016 at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i am surprised union introduces a stage. UnionRDD should have only narrow
> dependencies.
>
> On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>>
>> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Spark users and developers,
>>>
>>> anyone knows how to union two RDDs without the overhead of it?
>>>
>>> say rdd1.union(rdd2).saveTextFile(..)
>>> This requires a stage to union the 2 rdds before saveAsTextFile (2
>>> stages). Is there a way to skip the union step but have the contents of the
>>> two rdds save to the same output text file?
>>>
>>> Thank you!
>>>
>>> Jerry
>>>
>>
>>
>


-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra


  1   2   3   4   5   6   >