Re: [Spark] Working with JavaPairRDD from Scala

2017-07-22 Thread Lukasz Tracewski
Hi - and my thanks to you and Gerard. Only late hour in the night can explain 
how I could possibly miss this.

Cheers!
Lukasz

On 22/07/2017 10:48, yohann jardin wrote:

Hello Lukasz,


You can just:

val pairRdd = javapairrdd.rdd();


Then pairRdd will be of type RDD>, with K being 
com.vividsolutions.jts.geom.Polygon, and V being 
java.util.HashSet[com.vividsolutions.jts.geom.Polygon]



If you really want to continue with Java objects:

val calculateIntersection = new Function2, 
scala.collection.mutable.Set[Double]>() {}

and in the curly braces, overriding the call function.


Another solution would be to use lambda (I do not code much in scala and I'm 
definitely not sure this works, but I expect it to, so you'd have to test it):

javaparrdd.map((polygon: Polygon, hash: HashSet) => (polygon, 
hash.asScala.map(polygon.intersection(_).getArea))


De : Lukasz Tracewski 
<mailto:lukasz.tracew...@outlook.com>
Envoyé : samedi 22 juillet 2017 00:18
À : user@spark.apache.org<mailto:user@spark.apache.org>
Objet : [Spark] Working with JavaPairRDD from Scala


Hi,

I would like to call a method on JavaPairRDD from Scala and I am not sure how 
to write a function for the "map". I am using a third-party library that uses 
Spark for geospatial computations and it happens that it returns some results 
through Java API. I'd welcome a hint how to write a function for 'map' such 
that JavaPairRDD is happy.

Here's a signature:
org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]]
 = org.apache.spark.api.java.JavaPairRDD

Normally I would write something like this:

def calculate_intersection(polygon: Polygon, hashSet: HashSet[Polygon]) = {
  (polygon, hashSet.asScala.map(polygon.intersection(_).getArea))
}

javapairrdd.map(calculate_intersection)


... but it will complain that it's not a Java Function.

My first thought was to implement the interface, i.e.:


class PairRDDWrapper extends 
org.apache.spark.api.java.function.Function2[Polygon, HashSet[Polygon]]
{
  override def call(polygon: Polygon, hashSet: HashSet[Polygon]): (Polygon, 
scala.collection.mutable.Set[Double]) = {
(polygon, hashSet.asScala.map(polygon.intersection(_).getArea))
  }
}




I am not sure though how to use it, or if it makes any sense in the first 
place. Should be simple, it's just my Java / Scala is "little rusty".


Cheers,
Lucas



RE: [Spark] Working with JavaPairRDD from Scala

2017-07-22 Thread yohann jardin
Hello Lukasz,


You can just:

val pairRdd = javapairrdd.rdd();


Then pairRdd will be of type RDD>, with K being 
com.vividsolutions.jts.geom.Polygon, and V being 
java.util.HashSet[com.vividsolutions.jts.geom.Polygon]



If you really want to continue with Java objects:

val calculateIntersection = new Function2, 
scala.collection.mutable.Set[Double]>() {}

and in the curly braces, overriding the call function.


Another solution would be to use lambda (I do not code much in scala and I'm 
definitely not sure this works, but I expect it to, so you'd have to test it):

javaparrdd.map((polygon: Polygon, hash: HashSet) => (polygon, 
hash.asScala.map(polygon.intersection(_).getArea))


De : Lukasz Tracewski 
Envoyé : samedi 22 juillet 2017 00:18
À : user@spark.apache.org
Objet : [Spark] Working with JavaPairRDD from Scala


Hi,

I would like to call a method on JavaPairRDD from Scala and I am not sure how 
to write a function for the "map". I am using a third-party library that uses 
Spark for geospatial computations and it happens that it returns some results 
through Java API. I'd welcome a hint how to write a function for 'map' such 
that JavaPairRDD is happy.

Here's a signature:
org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]]
 = org.apache.spark.api.java.JavaPairRDD

Normally I would write something like this:

def calculate_intersection(polygon: Polygon, hashSet: HashSet[Polygon]) = {
  (polygon, hashSet.asScala.map(polygon.intersection(_).getArea))
}

javapairrdd.map(calculate_intersection)


... but it will complain that it's not a Java Function.

My first thought was to implement the interface, i.e.:


class PairRDDWrapper extends 
org.apache.spark.api.java.function.Function2[Polygon, HashSet[Polygon]]
{
  override def call(polygon: Polygon, hashSet: HashSet[Polygon]): (Polygon, 
scala.collection.mutable.Set[Double]) = {
(polygon, hashSet.asScala.map(polygon.intersection(_).getArea))
  }
}




I am not sure though how to use it, or if it makes any sense in the first 
place. Should be simple, it's just my Java / Scala is "little rusty".


Cheers,
Lucas


[Spark] Working with JavaPairRDD from Scala

2017-07-21 Thread Lukasz Tracewski
Hi,

I would like to call a method on JavaPairRDD from Scala and I am not sure how 
to write a function for the "map". I am using a third-party library that uses 
Spark for geospatial computations and it happens that it returns some results 
through Java API. I'd welcome a hint how to write a function for 'map' such 
that JavaPairRDD is happy.

Here's a signature:
org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]]
 = org.apache.spark.api.java.JavaPairRDD

Normally I would write something like this:

def calculate_intersection(polygon: Polygon, hashSet: HashSet[Polygon]) = {
  (polygon, hashSet.asScala.map(polygon.intersection(_).getArea))
}

javapairrdd.map(calculate_intersection)


... but it will complain that it's not a Java Function.

My first thought was to implement the interface, i.e.:


class PairRDDWrapper extends 
org.apache.spark.api.java.function.Function2[Polygon, HashSet[Polygon]]
{
  override def call(polygon: Polygon, hashSet: HashSet[Polygon]): (Polygon, 
scala.collection.mutable.Set[Double]) = {
(polygon, hashSet.asScala.map(polygon.intersection(_).getArea))
  }
}




I am not sure though how to use it, or if it makes any sense in the first 
place. Should be simple, it's just my Java / Scala is "little rusty".


Cheers,
Lucas


Re: mapValues Transformation (JavaPairRDD)

2015-12-15 Thread Sushrut Ikhar
Well the issue was because I was using some non thread-safe functions for
generating the key.

Regards,

Sushrut Ikhar
[image: https://]about.me/sushrutikhar
<https://about.me/sushrutikhar?promo=email_sig>


On Tue, Dec 15, 2015 at 2:27 PM, Paweł Szulc  wrote:

> Hard to imagine. Can you share a code sample?
>
> On Tue, Dec 15, 2015 at 8:06 AM, Sushrut Ikhar 
> wrote:
>
>> Hi,
>> I am finding it difficult to understand the following problem :
>> I count the number of records before and after applying the mapValues
>> transformation for a JavaPairRDD. As expected the number of records were
>> same before and after.
>>
>> Now, I counted number of distinct keys before and after applying the
>> mapValues transformation for the same JavaPairRDD. However, I get less
>> count after applying the transformation. I expected mapValues will not
>> change the keys. Then why am I getting lesser distinct keys? Note that -
>> the total records are the same only distinct keys have dropped.
>>
>> using spark-1.4.1.
>>
>> Thanks in advance.
>>
>> Regards,
>>
>> Sushrut Ikhar
>> [image: https://]about.me/sushrutikhar
>> <https://about.me/sushrutikhar?promo=email_sig>
>>
>>
>
>
>
> --
> Regards,
> Paul Szulc
>
> twitter: @rabbitonweb
> blog: www.rabbitonweb.com
>


Re: mapValues Transformation (JavaPairRDD)

2015-12-15 Thread Paweł Szulc
Hard to imagine. Can you share a code sample?

On Tue, Dec 15, 2015 at 8:06 AM, Sushrut Ikhar 
wrote:

> Hi,
> I am finding it difficult to understand the following problem :
> I count the number of records before and after applying the mapValues
> transformation for a JavaPairRDD. As expected the number of records were
> same before and after.
>
> Now, I counted number of distinct keys before and after applying the
> mapValues transformation for the same JavaPairRDD. However, I get less
> count after applying the transformation. I expected mapValues will not
> change the keys. Then why am I getting lesser distinct keys? Note that -
> the total records are the same only distinct keys have dropped.
>
> using spark-1.4.1.
>
> Thanks in advance.
>
> Regards,
>
> Sushrut Ikhar
> [image: https://]about.me/sushrutikhar
> <https://about.me/sushrutikhar?promo=email_sig>
>
>



-- 
Regards,
Paul Szulc

twitter: @rabbitonweb
blog: www.rabbitonweb.com


mapValues Transformation (JavaPairRDD)

2015-12-14 Thread Sushrut Ikhar
Hi,
I am finding it difficult to understand the following problem :
I count the number of records before and after applying the mapValues
transformation for a JavaPairRDD. As expected the number of records were
same before and after.

Now, I counted number of distinct keys before and after applying the
mapValues transformation for the same JavaPairRDD. However, I get less
count after applying the transformation. I expected mapValues will not
change the keys. Then why am I getting lesser distinct keys? Note that -
the total records are the same only distinct keys have dropped.

using spark-1.4.1.

Thanks in advance.

Regards,

Sushrut Ikhar
[image: https://]about.me/sushrutikhar
<https://about.me/sushrutikhar?promo=email_sig>


Re: Running FPGrowth over a JavaPairRDD?

2015-10-29 Thread Sabarish Sasidharan
Hi

You cannot use PairRDD but you can use JavaRDD. So in your case, to
make it work with least change, you would call run(transactions.values()).

Each MLLib implementation has its own data structure typically and you
would have to convert from your data structure before you invoke. For ex if
you were doing regression on transactions you would instead convert that to
an RDD of LabeledPoint using a transactions.map(). If you wanted clustering
you would convert that to an RDD of Vector.

And taking a step back, without knowing what you want to accomplish, What
your fp growth snippet will tell you is as to which sensor values occur
together most frequently. That may or may not be what you are looking for.

Regards
Sab
On 30-Oct-2015 3:00 am, "Fernando Paladini"  wrote:

> Hello guys!
>
> First of all, if you want to take a look in a more readable question, take
> a look in my StackOverflow question
> <http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object>
> (I've made the same question there).
>
> I want to test Spark machine learning algorithms and I have some questions
> on how to run these algorithms with non-native data types. I'm going to run
> FPGrowth algorithm over the input because I want to get the most frequent
> itemsets for this input.
>
> *My data is disposed as the following:*
>
> [timestamp, sensor1value, sensor2value] # id: 0[timestamp, sensor1value, 
> sensor2value] # id: 1[timestamp, sensor1value, sensor2value] # id: 
> 2[timestamp, sensor1value, sensor2value] # id: 3...
>
> As I need to use Java (because Python doesn't have a lot of machine
> learning algorithms from Spark), this data structure isn't very easy to
> handle / create.
>
> *To achieve this data structure in Java I can visualize two approaches:*
>
>1. Use existing Java classes and data types to structure the input (I
>think some problems can occur in Spark depending on how complex is my 
> data).
>2. Create my own class (don't know if it works with Spark algorithms)
>
> 1. Existing Java classes and data types
>
> In order to do that I've created a* List>>*, so
> I can keep my data structured and also can create a RDD:
>
> List>> algorithm_data = new ArrayList List>>();
> populate(algorithm_data);JavaPairRDD> transactions = 
> sc.parallelizePairs(algorithm_data);
>
> I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to be not 
> available for this data structure, as I will show you later in this post.
>
> 2. Create my own class
>
> I could also create a new class to store the input properly:
>
> public class PointValue {
>
> private long timestamp;
> private double sensorMeasure1;
> private double sensorMeasure2;
>
> // Constructor, getters and setters omitted...
> }
>
> However, I don't know if I can do that and still use it with Spark
> algorithms without any problems (in other words, running Spark algorithms
> without headaches). I'll focus in the first approach, but if you see that
> the second one is easier to achieve, please tell me.
> The solution (based on approach #1):
>
> // Initializing SparkSparkConf conf = new SparkConf().setAppName("FP-growth 
> Example");JavaSparkContext sc = new JavaSparkContext(conf);
> // Getting data for ML algorithmList>> 
> algorithm_data = new ArrayList>>();
> populate(algorithm_data);JavaPairRDD> transactions = 
> sc.parallelizePairs(algorithm_data);
> // Running FPGrowthFPGrowth fpg = new 
> FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel List>> model = fpg.run(transactions);
> // Printing everythingfor (FPGrowth.FreqItemset>> 
> itemset: model.freqItemsets().toJavaRDD().collect()) {
> System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());}
>
> But then I got:
>
> *The method run(JavaRDD) in the type FPGrowth is not applicable for 
> the arguments (JavaPairRDD>)*
>
> *What can I do in order to solve my problem (run FPGrowth over
> JavaPairRDD)?*
>
> I'm available to give you more information, just tell me exactly what you
> need.
> Thank you!
> Fernando Paladini
>


Running FPGrowth over a JavaPairRDD?

2015-10-29 Thread Fernando Paladini
Hello guys!

First of all, if you want to take a look in a more readable question, take
a look in my StackOverflow question
<http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object>
(I've made the same question there).

I want to test Spark machine learning algorithms and I have some questions
on how to run these algorithms with non-native data types. I'm going to run
FPGrowth algorithm over the input because I want to get the most frequent
itemsets for this input.

*My data is disposed as the following:*

[timestamp, sensor1value, sensor2value] # id: 0[timestamp,
sensor1value, sensor2value] # id: 1[timestamp, sensor1value,
sensor2value] # id: 2[timestamp, sensor1value, sensor2value] # id:
3...

As I need to use Java (because Python doesn't have a lot of machine
learning algorithms from Spark), this data structure isn't very easy to
handle / create.

*To achieve this data structure in Java I can visualize two approaches:*

   1. Use existing Java classes and data types to structure the input (I
   think some problems can occur in Spark depending on how complex is my data).
   2. Create my own class (don't know if it works with Spark algorithms)

1. Existing Java classes and data types

In order to do that I've created a* List>>*, so I
can keep my data structured and also can create a RDD:

List>> algorithm_data = new
ArrayList>>();
populate(algorithm_data);JavaPairRDD> transactions
= sc.parallelizePairs(algorithm_data);

I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to
be not available for this data structure, as I will show you later in
this post.

2. Create my own class

I could also create a new class to store the input properly:

public class PointValue {

private long timestamp;
private double sensorMeasure1;
private double sensorMeasure2;

// Constructor, getters and setters omitted...
}

However, I don't know if I can do that and still use it with Spark
algorithms without any problems (in other words, running Spark algorithms
without headaches). I'll focus in the first approach, but if you see that
the second one is easier to achieve, please tell me.
The solution (based on approach #1):

// Initializing SparkSparkConf conf = new
SparkConf().setAppName("FP-growth Example");JavaSparkContext sc = new
JavaSparkContext(conf);
// Getting data for ML algorithmList>>
algorithm_data = new ArrayList>>();
populate(algorithm_data);JavaPairRDD> transactions
= sc.parallelizePairs(algorithm_data);
// Running FPGrowthFPGrowth fpg = new
FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel>> model = fpg.run(transactions);
// Printing everythingfor (FPGrowth.FreqItemset>> itemset: model.freqItemsets().toJavaRDD().collect()) {
System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());}

But then I got:

*The method run(JavaRDD) in the type FPGrowth is not
applicable for the arguments (JavaPairRDD>)*

*What can I do in order to solve my problem (run FPGrowth over
JavaPairRDD)?*

I'm available to give you more information, just tell me exactly what you
need.
Thank you!
Fernando Paladini


Re: How to write mapreduce programming in spark by using java on user-defined javaPairRDD?

2015-07-07 Thread Feynman Liang
Hi MIssie,

In the Java API, you should consider:

   1. RDD.map
   
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#map(scala.Function1,%20scala.reflect.ClassTag)>
to
   transform the text
   2. RDD.sortBy
   
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#sortBy(scala.Function1,%20boolean,%20int,%20scala.math.Ordering,%20scala.reflect.ClassTag)>
to
   order by LongWritable
   3. RDD.saveAsTextFile
   
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#saveAsTextFile(java.lang.String)>
to
   write to HDFS


On Tue, Jul 7, 2015 at 7:18 AM, 付雅丹  wrote:

> Hi, everyone!
>
> I've got  pair in form of , where I used
> the following code:
>
> SparkConf conf = new SparkConf().setAppName("MapReduceFileInput");
> JavaSparkContext sc = new JavaSparkContext(conf);
> Configuration confHadoop = new Configuration();
>
> JavaPairRDD sourceFile=sc.newAPIHadoopFile(
> "hdfs://cMaster:9000/wcinput/data.txt",
> DataInputFormat.class,LongWritable.class,Text.class,confHadoop);
>
> Now I want to handle the javapairrdd data from  to
> another , where the Text content is different. After
> that, I want to write Text into hdfs in order of LongWritable value. But I
> don't know how to write mapreduce function in spark using java language.
> Someone can help me?
>
>
> Sincerely,
> Missie.
>


How to write mapreduce programming in spark by using java on user-defined javaPairRDD?

2015-07-07 Thread 付雅丹
Hi, everyone!

I've got  pair in form of , where I used the
following code:

SparkConf conf = new SparkConf().setAppName("MapReduceFileInput");
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration confHadoop = new Configuration();

JavaPairRDD sourceFile=sc.newAPIHadoopFile(
"hdfs://cMaster:9000/wcinput/data.txt",
DataInputFormat.class,LongWritable.class,Text.class,confHadoop);

Now I want to handle the javapairrdd data from  to
another , where the Text content is different. After
that, I want to write Text into hdfs in order of LongWritable value. But I
don't know how to write mapreduce function in spark using java language.
Someone can help me?


Sincerely,
Missie.


Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-23 Thread Nipun Arora
Thanks, will try this out and get back...

On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das  wrote:

> Try adding the provided scopes
>
>  
> org.apache.spark
> spark-core_2.10
> 1.4.0
>
> *provided  *
>  
> org.apache.spark
> spark-streaming_2.10
> 1.4.0
>
> *provided  *
>
> This prevents these artifacts from being included in the assembly JARs.
>
> See scope
>
> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope
>
> On Mon, Jun 22, 2015 at 10:28 AM, Nipun Arora 
> wrote:
>
>> Hi Tathagata,
>>
>> I am attaching a snapshot of my pom.xml. It would help immensely, if I
>> can include max, and min values in my mapper phase.
>>
>> The question is still open at :
>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
>>
>> I see that there is a bug report filed for a similar error as well:
>> https://issues.apache.org/jira/browse/SPARK-3266
>>
>> Please let me know, how I can get the same version of spark streaming in
>> my assembly.
>> I am using the following spark version:
>> http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz
>> .. no compilation, just an untar and use the spark-submit script in a local
>> install.
>>
>>
>> I still get the same error.
>>
>> Exception in thread "JobGenerator" java.lang.NoSuchMethodError: 
>> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
>>
>> 
>>  
>> org.apache.spark
>> spark-core_2.10
>> 1.4.0
>> 
>>  
>> org.apache.spark
>> spark-streaming_2.10
>> 1.4.0
>> 
>>
>> Thanks
>>
>> Nipun
>>
>>
>> On Thu, Jun 18, 2015 at 11:16 PM, Nipun Arora 
>> wrote:
>>
>>> Hi Tathagata,
>>>
>>> When you say please mark spark-core and spark-streaming as dependencies
>>> how do you mean?
>>> I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark
>>> downloads. In my maven pom.xml, I am using version 1.4 as described.
>>>
>>> Please let me know how I can fix that?
>>>
>>> Thanks
>>> Nipun
>>>
>>> On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das 
>>> wrote:
>>>
>>>> I think you may be including a different version of Spark Streaming in
>>>> your assembly. Please mark spark-core nd spark-streaming as provided
>>>> dependencies. Any installation of Spark will automatically provide Spark in
>>>> the classpath so you do not have to bundle it.
>>>>
>>>> On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have the following piece of code, where I am trying to transform a
>>>>> spark stream and add min and max to it of eachRDD. However, I get an error
>>>>> saying max call does not exist, at run-time (compiles properly). I am 
>>>>> using
>>>>> spark-1.4
>>>>>
>>>>> I have added the question to stackoverflow as well:
>>>>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
>>>>>
>>>>> Any help is greatly appreciated :)
>>>>>
>>>>> Thanks
>>>>> Nipun
>>>>>
>>>>> JavaPairDStream, Tuple3> 
>>>>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2());
>>>>>
>>>>> sortedtsStream.foreach(
>>>>> new Function, Tuple3>>>> Long, Long>>, Void>() {
>>>>> @Override
>>>>> public Void call(JavaPairRDD, 
>>>>> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception {
>>>>> List, 
>>>>> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect();
>>>>> for(Tuple2, 
>>>>> Tuple3> tuple :templist){
>>>>>
>>>>> Date date = new Date(tuple._1._1);
>>>>> int pattern = tuple._1._2;
>>>>> int count = tuple._2._1();
>>>>> Date maxDate = new Date(tuple._2._2());
>

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Tathagata Das
Try adding the provided scopes

 
org.apache.spark
spark-core_2.10
1.4.0

*provided  *
 
org.apache.spark
spark-streaming_2.10
1.4.0

*provided  *

This prevents these artifacts from being included in the assembly JARs.

See scope
https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope

On Mon, Jun 22, 2015 at 10:28 AM, Nipun Arora 
wrote:

> Hi Tathagata,
>
> I am attaching a snapshot of my pom.xml. It would help immensely, if I can
> include max, and min values in my mapper phase.
>
> The question is still open at :
> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
>
> I see that there is a bug report filed for a similar error as well:
> https://issues.apache.org/jira/browse/SPARK-3266
>
> Please let me know, how I can get the same version of spark streaming in
> my assembly.
> I am using the following spark version:
> http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz
> .. no compilation, just an untar and use the spark-submit script in a local
> install.
>
>
> I still get the same error.
>
> Exception in thread "JobGenerator" java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
>
> 
>  
> org.apache.spark
> spark-core_2.10
> 1.4.0
> 
>  
> org.apache.spark
> spark-streaming_2.10
> 1.4.0
> 
>
> Thanks
>
> Nipun
>
>
> On Thu, Jun 18, 2015 at 11:16 PM, Nipun Arora 
> wrote:
>
>> Hi Tathagata,
>>
>> When you say please mark spark-core and spark-streaming as dependencies
>> how do you mean?
>> I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark
>> downloads. In my maven pom.xml, I am using version 1.4 as described.
>>
>> Please let me know how I can fix that?
>>
>> Thanks
>> Nipun
>>
>> On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das 
>> wrote:
>>
>>> I think you may be including a different version of Spark Streaming in
>>> your assembly. Please mark spark-core nd spark-streaming as provided
>>> dependencies. Any installation of Spark will automatically provide Spark in
>>> the classpath so you do not have to bundle it.
>>>
>>> On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have the following piece of code, where I am trying to transform a
>>>> spark stream and add min and max to it of eachRDD. However, I get an error
>>>> saying max call does not exist, at run-time (compiles properly). I am using
>>>> spark-1.4
>>>>
>>>> I have added the question to stackoverflow as well:
>>>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
>>>>
>>>> Any help is greatly appreciated :)
>>>>
>>>> Thanks
>>>> Nipun
>>>>
>>>> JavaPairDStream, Tuple3> 
>>>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2());
>>>>
>>>> sortedtsStream.foreach(
>>>> new Function, Tuple3>>> Long, Long>>, Void>() {
>>>> @Override
>>>> public Void call(JavaPairRDD, 
>>>> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception {
>>>> List, 
>>>> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect();
>>>> for(Tuple2, 
>>>> Tuple3> tuple :templist){
>>>>
>>>> Date date = new Date(tuple._1._1);
>>>> int pattern = tuple._1._2;
>>>> int count = tuple._2._1();
>>>> Date maxDate = new Date(tuple._2._2());
>>>> Date minDate = new Date(tuple._2._2());
>>>> System.out.println("TimeSlot: " + date.toString() + " 
>>>> Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() 
>>>> + " Min: " + minDate.toString());
>>>>
>>>> }
>>>> return null;
>>>> }
>>>> }
>>>> );
>>>>
>>>> Error:
>>>>
>>>>
>>>> 15/06/18 11:05:06 INF

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Nipun Arora
Hi Tathagata,

I am attaching a snapshot of my pom.xml. It would help immensely, if I can
include max, and min values in my mapper phase.

The question is still open at :
http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796

I see that there is a bug report filed for a similar error as well:
https://issues.apache.org/jira/browse/SPARK-3266

Please let me know, how I can get the same version of spark streaming in my
assembly.
I am using the following spark version:
http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz
.. no compilation, just an untar and use the spark-submit script in a local
install.


I still get the same error.

Exception in thread "JobGenerator" java.lang.NoSuchMethodError:
org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;


 
org.apache.spark
spark-core_2.10
1.4.0

 
org.apache.spark
spark-streaming_2.10
1.4.0


Thanks

Nipun


On Thu, Jun 18, 2015 at 11:16 PM, Nipun Arora 
wrote:

> Hi Tathagata,
>
> When you say please mark spark-core and spark-streaming as dependencies
> how do you mean?
> I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark
> downloads. In my maven pom.xml, I am using version 1.4 as described.
>
> Please let me know how I can fix that?
>
> Thanks
> Nipun
>
> On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das 
> wrote:
>
>> I think you may be including a different version of Spark Streaming in
>> your assembly. Please mark spark-core nd spark-streaming as provided
>> dependencies. Any installation of Spark will automatically provide Spark in
>> the classpath so you do not have to bundle it.
>>
>> On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora 
>> wrote:
>>
>>> Hi,
>>>
>>> I have the following piece of code, where I am trying to transform a
>>> spark stream and add min and max to it of eachRDD. However, I get an error
>>> saying max call does not exist, at run-time (compiles properly). I am using
>>> spark-1.4
>>>
>>> I have added the question to stackoverflow as well:
>>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
>>>
>>> Any help is greatly appreciated :)
>>>
>>> Thanks
>>> Nipun
>>>
>>> JavaPairDStream, Tuple3> 
>>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2());
>>>
>>> sortedtsStream.foreach(
>>> new Function, Tuple3>> Long, Long>>, Void>() {
>>> @Override
>>> public Void call(JavaPairRDD, 
>>> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception {
>>> List, 
>>> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect();
>>> for(Tuple2, Tuple3> 
>>> tuple :templist){
>>>
>>> Date date = new Date(tuple._1._1);
>>> int pattern = tuple._1._2;
>>> int count = tuple._2._1();
>>> Date maxDate = new Date(tuple._2._2());
>>> Date minDate = new Date(tuple._2._2());
>>> System.out.println("TimeSlot: " + date.toString() + " 
>>> Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() + 
>>> " Min: " + minDate.toString());
>>>
>>> }
>>> return null;
>>> }
>>> }
>>> );
>>>
>>> Error:
>>>
>>>
>>> 15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in 
>>> memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18 11:05:06 
>>> INFO BlockGenerator: Pushed block input-0-1434639906000Exception in thread 
>>> "JobGenerator" java.lang.NoSuchMethodError: 
>>> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
>>> at 
>>> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346)
>>> at 
>>> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340)
>>> at 
>>> org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360)
>>> at 
>>> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
>>> at 
>>> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
>>> at 
>>> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf
>>>
>>>
>>
>


Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi Tathagata,

When you say please mark spark-core and spark-streaming as dependencies how
do you mean?
I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark
downloads. In my maven pom.xml, I am using version 1.4 as described.

Please let me know how I can fix that?

Thanks
Nipun

On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das  wrote:

> I think you may be including a different version of Spark Streaming in
> your assembly. Please mark spark-core nd spark-streaming as provided
> dependencies. Any installation of Spark will automatically provide Spark in
> the classpath so you do not have to bundle it.
>
> On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora 
> wrote:
>
>> Hi,
>>
>> I have the following piece of code, where I am trying to transform a
>> spark stream and add min and max to it of eachRDD. However, I get an error
>> saying max call does not exist, at run-time (compiles properly). I am using
>> spark-1.4
>>
>> I have added the question to stackoverflow as well:
>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
>>
>> Any help is greatly appreciated :)
>>
>> Thanks
>> Nipun
>>
>> JavaPairDStream, Tuple3> 
>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2());
>>
>> sortedtsStream.foreach(
>> new Function, Tuple3> Long, Long>>, Void>() {
>> @Override
>> public Void call(JavaPairRDD, 
>> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception {
>> List, 
>> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect();
>> for(Tuple2, Tuple3> 
>> tuple :templist){
>>
>> Date date = new Date(tuple._1._1);
>> int pattern = tuple._1._2;
>> int count = tuple._2._1();
>> Date maxDate = new Date(tuple._2._2());
>> Date minDate = new Date(tuple._2._2());
>> System.out.println("TimeSlot: " + date.toString() + " 
>> Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() + 
>> " Min: " + minDate.toString());
>>
>> }
>> return null;
>> }
>> }
>> );
>>
>> Error:
>>
>>
>> 15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in 
>> memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18 11:05:06 
>> INFO BlockGenerator: Pushed block input-0-1434639906000Exception in thread 
>> "JobGenerator" java.lang.NoSuchMethodError: 
>> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
>> at 
>> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346)
>> at 
>> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340)
>> at 
>> org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360)
>> at 
>> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
>> at 
>> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
>> at 
>> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf
>>
>>
>


Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Tathagata Das
I think you may be including a different version of Spark Streaming in your
assembly. Please mark spark-core nd spark-streaming as provided
dependencies. Any installation of Spark will automatically provide Spark in
the classpath so you do not have to bundle it.

On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora 
wrote:

> Hi,
>
> I have the following piece of code, where I am trying to transform a spark
> stream and add min and max to it of eachRDD. However, I get an error saying
> max call does not exist, at run-time (compiles properly). I am using
> spark-1.4
>
> I have added the question to stackoverflow as well:
> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
>
> Any help is greatly appreciated :)
>
> Thanks
> Nipun
>
> JavaPairDStream, Tuple3> 
> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2());
>
> sortedtsStream.foreach(
> new Function, Tuple3 Long>>, Void>() {
> @Override
> public Void call(JavaPairRDD, 
> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception {
> List, Tuple3> 
> >templist = tuple2Tuple3JavaPairRDD.collect();
> for(Tuple2, Tuple3> 
> tuple :templist){
>
> Date date = new Date(tuple._1._1);
> int pattern = tuple._1._2;
> int count = tuple._2._1();
> Date maxDate = new Date(tuple._2._2());
> Date minDate = new Date(tuple._2._2());
> System.out.println("TimeSlot: " + date.toString() + " 
> Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() + " 
> Min: " + minDate.toString());
>
> }
> return null;
> }
> }
> );
>
> Error:
>
>
> 15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in 
> memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18 11:05:06 
> INFO BlockGenerator: Pushed block input-0-1434639906000Exception in thread 
> "JobGenerator" java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
> at 
> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346)
> at 
> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340)
> at 
> org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360)
> at 
> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
> at 
> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf
>
>


[Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi,

I have the following piece of code, where I am trying to transform a spark
stream and add min and max to it of eachRDD. However, I get an error saying
max call does not exist, at run-time (compiles properly). I am using
spark-1.4

I have added the question to stackoverflow as well:
http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796

Any help is greatly appreciated :)

Thanks
Nipun

JavaPairDStream, Tuple3>
sortedtsStream = transformedMaxMintsStream.transformToPair(new
Sort2());

sortedtsStream.foreach(
new Function,
Tuple3>, Void>() {
@Override
public Void call(JavaPairRDD,
Tuple3> tuple2Tuple3JavaPairRDD) throws Exception
{
List,
Tuple3> >templist =
tuple2Tuple3JavaPairRDD.collect();
for(Tuple2,
Tuple3> tuple :templist){

Date date = new Date(tuple._1._1);
int pattern = tuple._1._2;
int count = tuple._2._1();
Date maxDate = new Date(tuple._2._2());
Date minDate = new Date(tuple._2._2());
System.out.println("TimeSlot: " + date.toString()
+ " Pattern: " + pattern + " Count: " + count + " Max: " +
maxDate.toString() + " Min: " + minDate.toString());

}
return null;
}
}
);

Error:


15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000
in memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18
11:05:06 INFO BlockGenerator: Pushed block
input-0-1434639906000Exception in thread "JobGenerator"
java.lang.NoSuchMethodError:
org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
at 
org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346)
at 
org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340)
at 
org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360)
at 
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
at 
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf


Filling Parquet files by values in Value of a JavaPairRDD

2015-06-06 Thread Mohamed Nadjib Mami
Hello Sparkers,

I'm reading data from a CSV  file, applying some transformations and ending
up with an RDD of pairs (String,Iterable<>).

I have already prepared Parquet files. I want now to take the previous
(key,value) RDD and populate the parquet files like follows:
- key holds the name of the Parquet file.
- value holds values to save in the parquet file whose name is the key.

I tried the simplest way that one can think of: creating a DataFrame inside
a 'map' or 'foreach' on the pair RDD, but this gave NullPointerException. I
read and found that this is because of nesting RDDs, which is not allowed.

Any help of how to achieve this in another way?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Filling-Parquet-files-by-values-in-Value-of-a-JavaPairRDD-tp23188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JavaPairRDD

2015-05-13 Thread Yasemin Kaya
Thank you Tristan. It is totally what I am looking for :)


2015-05-14 5:05 GMT+03:00 Tristan Blakers :

> You could use a map() operation, but the easiest way is probably to just
> call values() method on the JavaPairRDD to get a JavaRDD.
>
> See this link:
>
> https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
>
> Tristan
>
>
>
>
>
> On 13 May 2015 at 23:12, Yasemin Kaya  wrote:
>
>> Hi,
>>
>> I want to get  *JavaPairRDD *from the tuple part of 
>> *JavaPairRDD> Tuple2>  .*
>>
>> As an example: (
>> http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551)) in
>> my *JavaPairRDD> *and I want to get
>> *( (46551), (0,1,0,0,0,0,0,0) )*
>>
>> I try to split tuple._2() and create new JavaPairRDD but I can't.
>> How can I get that ?
>>
>> Have a nice day
>> yasemin
>> --
>> hiç ender hiç
>>
>
>


-- 
hiç ender hiç


Re: JavaPairRDD

2015-05-13 Thread Tristan Blakers
You could use a map() operation, but the easiest way is probably to just
call values() method on the JavaPairRDD to get a JavaRDD.

See this link:
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html

Tristan





On 13 May 2015 at 23:12, Yasemin Kaya  wrote:

> Hi,
>
> I want to get  *JavaPairRDD *from the tuple part of 
> *JavaPairRDD Tuple2>  .*
>
> As an example: (
> http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551)) in
> my *JavaPairRDD> *and I want to get
> *( (46551), (0,1,0,0,0,0,0,0) )*
>
> I try to split tuple._2() and create new JavaPairRDD but I can't.
> How can I get that ?
>
> Have a nice day
> yasemin
> --
> hiç ender hiç
>


JavaPairRDD

2015-05-13 Thread Yasemin Kaya
Hi,

I want to get  *JavaPairRDD *from the tuple part of
*JavaPairRDD>  .*

As an example: (
http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551))
in my *JavaPairRDD> *and I want to get
*( (46551), (0,1,0,0,0,0,0,0) )*

I try to split tuple._2() and create new JavaPairRDD but I can't.
How can I get that ?

Have a nice day
yasemin
-- 
hiç ender hiç


Re: [STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread Sean Owen
You need to access the underlying RDD with .rdd() and cast that. That
works for me.

On Mon, Apr 20, 2015 at 4:41 AM, RimBerry
 wrote:
> Hi everyone,
>
> i am trying to use the direct approach  in  streaming-kafka-integration
> <http://spark.apache.org/docs/latest/streaming-kafka-integration.html>
> pulling data from kafka as follow
>
>  JavaPairInputDStream messages =
> KafkaUatils.createDirectStream(jssc,
>   
> String.class,
>   
> String.class,
>   
> StringDecoder.class,
>   
> StringDecoder.class,
>   
> kafkaParams,
>   
> topicsSet);
>
> messages.foreachRDD(
> new Function, Void>() {
>     @Override
>  public Void call(JavaPairRDD String> rdd) throws IOException {
> OffsetRange[] offsetRanges = 
> ((HasOffsetRanges) rdd).offsetRanges();
> //.
> return null;
> }
> }
> );
>
> then i got an error when running it
> *java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot
> be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at
> "OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();"
>
> i am using the version 1.3.1 if is it a bug in this version ?
>
> Thank you for spending time with me.
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread RimBerry
Hi everyone,

i am trying to use the direct approach  in  streaming-kafka-integration
<http://spark.apache.org/docs/latest/streaming-kafka-integration.html>  
pulling data from kafka as follow

 JavaPairInputDStream messages = 
KafkaUatils.createDirectStream(jssc, 

  String.class, 

  String.class, 

  StringDecoder.class, 

  StringDecoder.class, 

  kafkaParams, 

  topicsSet);

messages.foreachRDD(
new Function, Void>() {
@Override
 public Void call(JavaPairRDD rdd) throws IOException {
OffsetRange[] offsetRanges = 
((HasOffsetRanges) rdd).offsetRanges();
//.
return null;
}
}
);

then i got an error when running it
*java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot
be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at
"OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();"

i am using the version 1.3.1 if is it a bug in this version ?

Thank you for spending time with me.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JavaPairRDD to JavaPairRDD based on key

2014-09-10 Thread Sean Owen
So, each key-value pair gets a new value for the original key? you
want mapValues().

On Wed, Sep 10, 2014 at 2:01 PM, Tom  wrote:
> Is it possible to generate a JavaPairRDD from a
> JavaPairRDD, where I can also use the key values? I have
> looked at for instance mapToPair, but this generates a new K/V pair based on
> the original value, and does not give me information about the key.
>
> I need this in the initialization phase, where I have two RDD's with similar
> keys, but with different types of values. Generating these is computational
> intensive, and if I could use the first list to generate the second, it
> would save me a big map/reduce phase.
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/JavaPairRDD-String-Integer-to-JavaPairRDD-String-String-based-on-key-tp13875.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



JavaPairRDD to JavaPairRDD based on key

2014-09-10 Thread Tom
Is it possible to generate a JavaPairRDD from a
JavaPairRDD, where I can also use the key values? I have
looked at for instance mapToPair, but this generates a new K/V pair based on
the original value, and does not give me information about the key.

I need this in the initialization phase, where I have two RDD's with similar
keys, but with different types of values. Generating these is computational
intensive, and if I could use the first list to generate the second, it
would save me a big map/reduce phase.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JavaPairRDD-String-Integer-to-JavaPairRDD-String-String-based-on-key-tp13875.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org