Re: [Spark] Working with JavaPairRDD from Scala
Hi - and my thanks to you and Gerard. Only late hour in the night can explain how I could possibly miss this. Cheers! Lukasz On 22/07/2017 10:48, yohann jardin wrote: Hello Lukasz, You can just: val pairRdd = javapairrdd.rdd(); Then pairRdd will be of type RDD>, with K being com.vividsolutions.jts.geom.Polygon, and V being java.util.HashSet[com.vividsolutions.jts.geom.Polygon] If you really want to continue with Java objects: val calculateIntersection = new Function2, scala.collection.mutable.Set[Double]>() {} and in the curly braces, overriding the call function. Another solution would be to use lambda (I do not code much in scala and I'm definitely not sure this works, but I expect it to, so you'd have to test it): javaparrdd.map((polygon: Polygon, hash: HashSet) => (polygon, hash.asScala.map(polygon.intersection(_).getArea)) De : Lukasz Tracewski <mailto:lukasz.tracew...@outlook.com> Envoyé : samedi 22 juillet 2017 00:18 À : user@spark.apache.org<mailto:user@spark.apache.org> Objet : [Spark] Working with JavaPairRDD from Scala Hi, I would like to call a method on JavaPairRDD from Scala and I am not sure how to write a function for the "map". I am using a third-party library that uses Spark for geospatial computations and it happens that it returns some results through Java API. I'd welcome a hint how to write a function for 'map' such that JavaPairRDD is happy. Here's a signature: org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]] = org.apache.spark.api.java.JavaPairRDD Normally I would write something like this: def calculate_intersection(polygon: Polygon, hashSet: HashSet[Polygon]) = { (polygon, hashSet.asScala.map(polygon.intersection(_).getArea)) } javapairrdd.map(calculate_intersection) ... but it will complain that it's not a Java Function. My first thought was to implement the interface, i.e.: class PairRDDWrapper extends org.apache.spark.api.java.function.Function2[Polygon, HashSet[Polygon]] { override def call(polygon: Polygon, hashSet: HashSet[Polygon]): (Polygon, scala.collection.mutable.Set[Double]) = { (polygon, hashSet.asScala.map(polygon.intersection(_).getArea)) } } I am not sure though how to use it, or if it makes any sense in the first place. Should be simple, it's just my Java / Scala is "little rusty". Cheers, Lucas
RE: [Spark] Working with JavaPairRDD from Scala
Hello Lukasz, You can just: val pairRdd = javapairrdd.rdd(); Then pairRdd will be of type RDD>, with K being com.vividsolutions.jts.geom.Polygon, and V being java.util.HashSet[com.vividsolutions.jts.geom.Polygon] If you really want to continue with Java objects: val calculateIntersection = new Function2, scala.collection.mutable.Set[Double]>() {} and in the curly braces, overriding the call function. Another solution would be to use lambda (I do not code much in scala and I'm definitely not sure this works, but I expect it to, so you'd have to test it): javaparrdd.map((polygon: Polygon, hash: HashSet) => (polygon, hash.asScala.map(polygon.intersection(_).getArea)) De : Lukasz Tracewski Envoyé : samedi 22 juillet 2017 00:18 À : user@spark.apache.org Objet : [Spark] Working with JavaPairRDD from Scala Hi, I would like to call a method on JavaPairRDD from Scala and I am not sure how to write a function for the "map". I am using a third-party library that uses Spark for geospatial computations and it happens that it returns some results through Java API. I'd welcome a hint how to write a function for 'map' such that JavaPairRDD is happy. Here's a signature: org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]] = org.apache.spark.api.java.JavaPairRDD Normally I would write something like this: def calculate_intersection(polygon: Polygon, hashSet: HashSet[Polygon]) = { (polygon, hashSet.asScala.map(polygon.intersection(_).getArea)) } javapairrdd.map(calculate_intersection) ... but it will complain that it's not a Java Function. My first thought was to implement the interface, i.e.: class PairRDDWrapper extends org.apache.spark.api.java.function.Function2[Polygon, HashSet[Polygon]] { override def call(polygon: Polygon, hashSet: HashSet[Polygon]): (Polygon, scala.collection.mutable.Set[Double]) = { (polygon, hashSet.asScala.map(polygon.intersection(_).getArea)) } } I am not sure though how to use it, or if it makes any sense in the first place. Should be simple, it's just my Java / Scala is "little rusty". Cheers, Lucas
[Spark] Working with JavaPairRDD from Scala
Hi, I would like to call a method on JavaPairRDD from Scala and I am not sure how to write a function for the "map". I am using a third-party library that uses Spark for geospatial computations and it happens that it returns some results through Java API. I'd welcome a hint how to write a function for 'map' such that JavaPairRDD is happy. Here's a signature: org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Polygon,java.util.HashSet[com.vividsolutions.jts.geom.Polygon]] = org.apache.spark.api.java.JavaPairRDD Normally I would write something like this: def calculate_intersection(polygon: Polygon, hashSet: HashSet[Polygon]) = { (polygon, hashSet.asScala.map(polygon.intersection(_).getArea)) } javapairrdd.map(calculate_intersection) ... but it will complain that it's not a Java Function. My first thought was to implement the interface, i.e.: class PairRDDWrapper extends org.apache.spark.api.java.function.Function2[Polygon, HashSet[Polygon]] { override def call(polygon: Polygon, hashSet: HashSet[Polygon]): (Polygon, scala.collection.mutable.Set[Double]) = { (polygon, hashSet.asScala.map(polygon.intersection(_).getArea)) } } I am not sure though how to use it, or if it makes any sense in the first place. Should be simple, it's just my Java / Scala is "little rusty". Cheers, Lucas
Re: mapValues Transformation (JavaPairRDD)
Well the issue was because I was using some non thread-safe functions for generating the key. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig> On Tue, Dec 15, 2015 at 2:27 PM, Paweł Szulc wrote: > Hard to imagine. Can you share a code sample? > > On Tue, Dec 15, 2015 at 8:06 AM, Sushrut Ikhar > wrote: > >> Hi, >> I am finding it difficult to understand the following problem : >> I count the number of records before and after applying the mapValues >> transformation for a JavaPairRDD. As expected the number of records were >> same before and after. >> >> Now, I counted number of distinct keys before and after applying the >> mapValues transformation for the same JavaPairRDD. However, I get less >> count after applying the transformation. I expected mapValues will not >> change the keys. Then why am I getting lesser distinct keys? Note that - >> the total records are the same only distinct keys have dropped. >> >> using spark-1.4.1. >> >> Thanks in advance. >> >> Regards, >> >> Sushrut Ikhar >> [image: https://]about.me/sushrutikhar >> <https://about.me/sushrutikhar?promo=email_sig> >> >> > > > > -- > Regards, > Paul Szulc > > twitter: @rabbitonweb > blog: www.rabbitonweb.com >
Re: mapValues Transformation (JavaPairRDD)
Hard to imagine. Can you share a code sample? On Tue, Dec 15, 2015 at 8:06 AM, Sushrut Ikhar wrote: > Hi, > I am finding it difficult to understand the following problem : > I count the number of records before and after applying the mapValues > transformation for a JavaPairRDD. As expected the number of records were > same before and after. > > Now, I counted number of distinct keys before and after applying the > mapValues transformation for the same JavaPairRDD. However, I get less > count after applying the transformation. I expected mapValues will not > change the keys. Then why am I getting lesser distinct keys? Note that - > the total records are the same only distinct keys have dropped. > > using spark-1.4.1. > > Thanks in advance. > > Regards, > > Sushrut Ikhar > [image: https://]about.me/sushrutikhar > <https://about.me/sushrutikhar?promo=email_sig> > > -- Regards, Paul Szulc twitter: @rabbitonweb blog: www.rabbitonweb.com
mapValues Transformation (JavaPairRDD)
Hi, I am finding it difficult to understand the following problem : I count the number of records before and after applying the mapValues transformation for a JavaPairRDD. As expected the number of records were same before and after. Now, I counted number of distinct keys before and after applying the mapValues transformation for the same JavaPairRDD. However, I get less count after applying the transformation. I expected mapValues will not change the keys. Then why am I getting lesser distinct keys? Note that - the total records are the same only distinct keys have dropped. using spark-1.4.1. Thanks in advance. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig>
Re: Running FPGrowth over a JavaPairRDD?
Hi You cannot use PairRDD but you can use JavaRDD. So in your case, to make it work with least change, you would call run(transactions.values()). Each MLLib implementation has its own data structure typically and you would have to convert from your data structure before you invoke. For ex if you were doing regression on transactions you would instead convert that to an RDD of LabeledPoint using a transactions.map(). If you wanted clustering you would convert that to an RDD of Vector. And taking a step back, without knowing what you want to accomplish, What your fp growth snippet will tell you is as to which sensor values occur together most frequently. That may or may not be what you are looking for. Regards Sab On 30-Oct-2015 3:00 am, "Fernando Paladini" wrote: > Hello guys! > > First of all, if you want to take a look in a more readable question, take > a look in my StackOverflow question > <http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object> > (I've made the same question there). > > I want to test Spark machine learning algorithms and I have some questions > on how to run these algorithms with non-native data types. I'm going to run > FPGrowth algorithm over the input because I want to get the most frequent > itemsets for this input. > > *My data is disposed as the following:* > > [timestamp, sensor1value, sensor2value] # id: 0[timestamp, sensor1value, > sensor2value] # id: 1[timestamp, sensor1value, sensor2value] # id: > 2[timestamp, sensor1value, sensor2value] # id: 3... > > As I need to use Java (because Python doesn't have a lot of machine > learning algorithms from Spark), this data structure isn't very easy to > handle / create. > > *To achieve this data structure in Java I can visualize two approaches:* > >1. Use existing Java classes and data types to structure the input (I >think some problems can occur in Spark depending on how complex is my > data). >2. Create my own class (don't know if it works with Spark algorithms) > > 1. Existing Java classes and data types > > In order to do that I've created a* List>>*, so > I can keep my data structured and also can create a RDD: > > List>> algorithm_data = new ArrayList List>>(); > populate(algorithm_data);JavaPairRDD> transactions = > sc.parallelizePairs(algorithm_data); > > I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to be not > available for this data structure, as I will show you later in this post. > > 2. Create my own class > > I could also create a new class to store the input properly: > > public class PointValue { > > private long timestamp; > private double sensorMeasure1; > private double sensorMeasure2; > > // Constructor, getters and setters omitted... > } > > However, I don't know if I can do that and still use it with Spark > algorithms without any problems (in other words, running Spark algorithms > without headaches). I'll focus in the first approach, but if you see that > the second one is easier to achieve, please tell me. > The solution (based on approach #1): > > // Initializing SparkSparkConf conf = new SparkConf().setAppName("FP-growth > Example");JavaSparkContext sc = new JavaSparkContext(conf); > // Getting data for ML algorithmList>> > algorithm_data = new ArrayList>>(); > populate(algorithm_data);JavaPairRDD> transactions = > sc.parallelizePairs(algorithm_data); > // Running FPGrowthFPGrowth fpg = new > FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel List>> model = fpg.run(transactions); > // Printing everythingfor (FPGrowth.FreqItemset>> > itemset: model.freqItemsets().toJavaRDD().collect()) { > System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());} > > But then I got: > > *The method run(JavaRDD) in the type FPGrowth is not applicable for > the arguments (JavaPairRDD>)* > > *What can I do in order to solve my problem (run FPGrowth over > JavaPairRDD)?* > > I'm available to give you more information, just tell me exactly what you > need. > Thank you! > Fernando Paladini >
Running FPGrowth over a JavaPairRDD?
Hello guys! First of all, if you want to take a look in a more readable question, take a look in my StackOverflow question <http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object> (I've made the same question there). I want to test Spark machine learning algorithms and I have some questions on how to run these algorithms with non-native data types. I'm going to run FPGrowth algorithm over the input because I want to get the most frequent itemsets for this input. *My data is disposed as the following:* [timestamp, sensor1value, sensor2value] # id: 0[timestamp, sensor1value, sensor2value] # id: 1[timestamp, sensor1value, sensor2value] # id: 2[timestamp, sensor1value, sensor2value] # id: 3... As I need to use Java (because Python doesn't have a lot of machine learning algorithms from Spark), this data structure isn't very easy to handle / create. *To achieve this data structure in Java I can visualize two approaches:* 1. Use existing Java classes and data types to structure the input (I think some problems can occur in Spark depending on how complex is my data). 2. Create my own class (don't know if it works with Spark algorithms) 1. Existing Java classes and data types In order to do that I've created a* List>>*, so I can keep my data structured and also can create a RDD: List>> algorithm_data = new ArrayList>>(); populate(algorithm_data);JavaPairRDD> transactions = sc.parallelizePairs(algorithm_data); I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to be not available for this data structure, as I will show you later in this post. 2. Create my own class I could also create a new class to store the input properly: public class PointValue { private long timestamp; private double sensorMeasure1; private double sensorMeasure2; // Constructor, getters and setters omitted... } However, I don't know if I can do that and still use it with Spark algorithms without any problems (in other words, running Spark algorithms without headaches). I'll focus in the first approach, but if you see that the second one is easier to achieve, please tell me. The solution (based on approach #1): // Initializing SparkSparkConf conf = new SparkConf().setAppName("FP-growth Example");JavaSparkContext sc = new JavaSparkContext(conf); // Getting data for ML algorithmList>> algorithm_data = new ArrayList>>(); populate(algorithm_data);JavaPairRDD> transactions = sc.parallelizePairs(algorithm_data); // Running FPGrowthFPGrowth fpg = new FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel>> model = fpg.run(transactions); // Printing everythingfor (FPGrowth.FreqItemset>> itemset: model.freqItemsets().toJavaRDD().collect()) { System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());} But then I got: *The method run(JavaRDD) in the type FPGrowth is not applicable for the arguments (JavaPairRDD>)* *What can I do in order to solve my problem (run FPGrowth over JavaPairRDD)?* I'm available to give you more information, just tell me exactly what you need. Thank you! Fernando Paladini
Re: How to write mapreduce programming in spark by using java on user-defined javaPairRDD?
Hi MIssie, In the Java API, you should consider: 1. RDD.map <https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#map(scala.Function1,%20scala.reflect.ClassTag)> to transform the text 2. RDD.sortBy <https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#sortBy(scala.Function1,%20boolean,%20int,%20scala.math.Ordering,%20scala.reflect.ClassTag)> to order by LongWritable 3. RDD.saveAsTextFile <https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#saveAsTextFile(java.lang.String)> to write to HDFS On Tue, Jul 7, 2015 at 7:18 AM, 付雅丹 wrote: > Hi, everyone! > > I've got pair in form of , where I used > the following code: > > SparkConf conf = new SparkConf().setAppName("MapReduceFileInput"); > JavaSparkContext sc = new JavaSparkContext(conf); > Configuration confHadoop = new Configuration(); > > JavaPairRDD sourceFile=sc.newAPIHadoopFile( > "hdfs://cMaster:9000/wcinput/data.txt", > DataInputFormat.class,LongWritable.class,Text.class,confHadoop); > > Now I want to handle the javapairrdd data from to > another , where the Text content is different. After > that, I want to write Text into hdfs in order of LongWritable value. But I > don't know how to write mapreduce function in spark using java language. > Someone can help me? > > > Sincerely, > Missie. >
How to write mapreduce programming in spark by using java on user-defined javaPairRDD?
Hi, everyone! I've got pair in form of , where I used the following code: SparkConf conf = new SparkConf().setAppName("MapReduceFileInput"); JavaSparkContext sc = new JavaSparkContext(conf); Configuration confHadoop = new Configuration(); JavaPairRDD sourceFile=sc.newAPIHadoopFile( "hdfs://cMaster:9000/wcinput/data.txt", DataInputFormat.class,LongWritable.class,Text.class,confHadoop); Now I want to handle the javapairrdd data from to another , where the Text content is different. After that, I want to write Text into hdfs in order of LongWritable value. But I don't know how to write mapreduce function in spark using java language. Someone can help me? Sincerely, Missie.
Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD
Thanks, will try this out and get back... On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das wrote: > Try adding the provided scopes > > > org.apache.spark > spark-core_2.10 > 1.4.0 > > *provided * > > org.apache.spark > spark-streaming_2.10 > 1.4.0 > > *provided * > > This prevents these artifacts from being included in the assembly JARs. > > See scope > > https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope > > On Mon, Jun 22, 2015 at 10:28 AM, Nipun Arora > wrote: > >> Hi Tathagata, >> >> I am attaching a snapshot of my pom.xml. It would help immensely, if I >> can include max, and min values in my mapper phase. >> >> The question is still open at : >> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 >> >> I see that there is a bug report filed for a similar error as well: >> https://issues.apache.org/jira/browse/SPARK-3266 >> >> Please let me know, how I can get the same version of spark streaming in >> my assembly. >> I am using the following spark version: >> http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz >> .. no compilation, just an untar and use the spark-submit script in a local >> install. >> >> >> I still get the same error. >> >> Exception in thread "JobGenerator" java.lang.NoSuchMethodError: >> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2; >> >> >> >> org.apache.spark >> spark-core_2.10 >> 1.4.0 >> >> >> org.apache.spark >> spark-streaming_2.10 >> 1.4.0 >> >> >> Thanks >> >> Nipun >> >> >> On Thu, Jun 18, 2015 at 11:16 PM, Nipun Arora >> wrote: >> >>> Hi Tathagata, >>> >>> When you say please mark spark-core and spark-streaming as dependencies >>> how do you mean? >>> I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark >>> downloads. In my maven pom.xml, I am using version 1.4 as described. >>> >>> Please let me know how I can fix that? >>> >>> Thanks >>> Nipun >>> >>> On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das >>> wrote: >>> >>>> I think you may be including a different version of Spark Streaming in >>>> your assembly. Please mark spark-core nd spark-streaming as provided >>>> dependencies. Any installation of Spark will automatically provide Spark in >>>> the classpath so you do not have to bundle it. >>>> >>>> On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have the following piece of code, where I am trying to transform a >>>>> spark stream and add min and max to it of eachRDD. However, I get an error >>>>> saying max call does not exist, at run-time (compiles properly). I am >>>>> using >>>>> spark-1.4 >>>>> >>>>> I have added the question to stackoverflow as well: >>>>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 >>>>> >>>>> Any help is greatly appreciated :) >>>>> >>>>> Thanks >>>>> Nipun >>>>> >>>>> JavaPairDStream, Tuple3> >>>>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2()); >>>>> >>>>> sortedtsStream.foreach( >>>>> new Function, Tuple3>>>> Long, Long>>, Void>() { >>>>> @Override >>>>> public Void call(JavaPairRDD, >>>>> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception { >>>>> List, >>>>> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect(); >>>>> for(Tuple2, >>>>> Tuple3> tuple :templist){ >>>>> >>>>> Date date = new Date(tuple._1._1); >>>>> int pattern = tuple._1._2; >>>>> int count = tuple._2._1(); >>>>> Date maxDate = new Date(tuple._2._2()); >
Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD
Try adding the provided scopes org.apache.spark spark-core_2.10 1.4.0 *provided * org.apache.spark spark-streaming_2.10 1.4.0 *provided * This prevents these artifacts from being included in the assembly JARs. See scope https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope On Mon, Jun 22, 2015 at 10:28 AM, Nipun Arora wrote: > Hi Tathagata, > > I am attaching a snapshot of my pom.xml. It would help immensely, if I can > include max, and min values in my mapper phase. > > The question is still open at : > http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 > > I see that there is a bug report filed for a similar error as well: > https://issues.apache.org/jira/browse/SPARK-3266 > > Please let me know, how I can get the same version of spark streaming in > my assembly. > I am using the following spark version: > http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz > .. no compilation, just an untar and use the spark-submit script in a local > install. > > > I still get the same error. > > Exception in thread "JobGenerator" java.lang.NoSuchMethodError: > org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2; > > > > org.apache.spark > spark-core_2.10 > 1.4.0 > > > org.apache.spark > spark-streaming_2.10 > 1.4.0 > > > Thanks > > Nipun > > > On Thu, Jun 18, 2015 at 11:16 PM, Nipun Arora > wrote: > >> Hi Tathagata, >> >> When you say please mark spark-core and spark-streaming as dependencies >> how do you mean? >> I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark >> downloads. In my maven pom.xml, I am using version 1.4 as described. >> >> Please let me know how I can fix that? >> >> Thanks >> Nipun >> >> On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das >> wrote: >> >>> I think you may be including a different version of Spark Streaming in >>> your assembly. Please mark spark-core nd spark-streaming as provided >>> dependencies. Any installation of Spark will automatically provide Spark in >>> the classpath so you do not have to bundle it. >>> >>> On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora >>> wrote: >>> >>>> Hi, >>>> >>>> I have the following piece of code, where I am trying to transform a >>>> spark stream and add min and max to it of eachRDD. However, I get an error >>>> saying max call does not exist, at run-time (compiles properly). I am using >>>> spark-1.4 >>>> >>>> I have added the question to stackoverflow as well: >>>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 >>>> >>>> Any help is greatly appreciated :) >>>> >>>> Thanks >>>> Nipun >>>> >>>> JavaPairDStream, Tuple3> >>>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2()); >>>> >>>> sortedtsStream.foreach( >>>> new Function, Tuple3>>> Long, Long>>, Void>() { >>>> @Override >>>> public Void call(JavaPairRDD, >>>> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception { >>>> List, >>>> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect(); >>>> for(Tuple2, >>>> Tuple3> tuple :templist){ >>>> >>>> Date date = new Date(tuple._1._1); >>>> int pattern = tuple._1._2; >>>> int count = tuple._2._1(); >>>> Date maxDate = new Date(tuple._2._2()); >>>> Date minDate = new Date(tuple._2._2()); >>>> System.out.println("TimeSlot: " + date.toString() + " >>>> Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() >>>> + " Min: " + minDate.toString()); >>>> >>>> } >>>> return null; >>>> } >>>> } >>>> ); >>>> >>>> Error: >>>> >>>> >>>> 15/06/18 11:05:06 INF
Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD
Hi Tathagata, I am attaching a snapshot of my pom.xml. It would help immensely, if I can include max, and min values in my mapper phase. The question is still open at : http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 I see that there is a bug report filed for a similar error as well: https://issues.apache.org/jira/browse/SPARK-3266 Please let me know, how I can get the same version of spark streaming in my assembly. I am using the following spark version: http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz .. no compilation, just an untar and use the spark-submit script in a local install. I still get the same error. Exception in thread "JobGenerator" java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2; org.apache.spark spark-core_2.10 1.4.0 org.apache.spark spark-streaming_2.10 1.4.0 Thanks Nipun On Thu, Jun 18, 2015 at 11:16 PM, Nipun Arora wrote: > Hi Tathagata, > > When you say please mark spark-core and spark-streaming as dependencies > how do you mean? > I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark > downloads. In my maven pom.xml, I am using version 1.4 as described. > > Please let me know how I can fix that? > > Thanks > Nipun > > On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das > wrote: > >> I think you may be including a different version of Spark Streaming in >> your assembly. Please mark spark-core nd spark-streaming as provided >> dependencies. Any installation of Spark will automatically provide Spark in >> the classpath so you do not have to bundle it. >> >> On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora >> wrote: >> >>> Hi, >>> >>> I have the following piece of code, where I am trying to transform a >>> spark stream and add min and max to it of eachRDD. However, I get an error >>> saying max call does not exist, at run-time (compiles properly). I am using >>> spark-1.4 >>> >>> I have added the question to stackoverflow as well: >>> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 >>> >>> Any help is greatly appreciated :) >>> >>> Thanks >>> Nipun >>> >>> JavaPairDStream, Tuple3> >>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2()); >>> >>> sortedtsStream.foreach( >>> new Function, Tuple3>> Long, Long>>, Void>() { >>> @Override >>> public Void call(JavaPairRDD, >>> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception { >>> List, >>> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect(); >>> for(Tuple2, Tuple3> >>> tuple :templist){ >>> >>> Date date = new Date(tuple._1._1); >>> int pattern = tuple._1._2; >>> int count = tuple._2._1(); >>> Date maxDate = new Date(tuple._2._2()); >>> Date minDate = new Date(tuple._2._2()); >>> System.out.println("TimeSlot: " + date.toString() + " >>> Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() + >>> " Min: " + minDate.toString()); >>> >>> } >>> return null; >>> } >>> } >>> ); >>> >>> Error: >>> >>> >>> 15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in >>> memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18 11:05:06 >>> INFO BlockGenerator: Pushed block input-0-1434639906000Exception in thread >>> "JobGenerator" java.lang.NoSuchMethodError: >>> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2; >>> at >>> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346) >>> at >>> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340) >>> at >>> org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360) >>> at >>> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) >>> at >>> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) >>> at >>> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf >>> >>> >> >
Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD
Hi Tathagata, When you say please mark spark-core and spark-streaming as dependencies how do you mean? I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark downloads. In my maven pom.xml, I am using version 1.4 as described. Please let me know how I can fix that? Thanks Nipun On Thu, Jun 18, 2015 at 4:22 PM, Tathagata Das wrote: > I think you may be including a different version of Spark Streaming in > your assembly. Please mark spark-core nd spark-streaming as provided > dependencies. Any installation of Spark will automatically provide Spark in > the classpath so you do not have to bundle it. > > On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora > wrote: > >> Hi, >> >> I have the following piece of code, where I am trying to transform a >> spark stream and add min and max to it of eachRDD. However, I get an error >> saying max call does not exist, at run-time (compiles properly). I am using >> spark-1.4 >> >> I have added the question to stackoverflow as well: >> http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 >> >> Any help is greatly appreciated :) >> >> Thanks >> Nipun >> >> JavaPairDStream, Tuple3> >> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2()); >> >> sortedtsStream.foreach( >> new Function, Tuple3> Long, Long>>, Void>() { >> @Override >> public Void call(JavaPairRDD, >> Tuple3> tuple2Tuple3JavaPairRDD) throws Exception { >> List, >> Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect(); >> for(Tuple2, Tuple3> >> tuple :templist){ >> >> Date date = new Date(tuple._1._1); >> int pattern = tuple._1._2; >> int count = tuple._2._1(); >> Date maxDate = new Date(tuple._2._2()); >> Date minDate = new Date(tuple._2._2()); >> System.out.println("TimeSlot: " + date.toString() + " >> Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() + >> " Min: " + minDate.toString()); >> >> } >> return null; >> } >> } >> ); >> >> Error: >> >> >> 15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in >> memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18 11:05:06 >> INFO BlockGenerator: Pushed block input-0-1434639906000Exception in thread >> "JobGenerator" java.lang.NoSuchMethodError: >> org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2; >> at >> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346) >> at >> org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340) >> at >> org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360) >> at >> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) >> at >> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) >> at >> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf >> >> >
Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD
I think you may be including a different version of Spark Streaming in your assembly. Please mark spark-core nd spark-streaming as provided dependencies. Any installation of Spark will automatically provide Spark in the classpath so you do not have to bundle it. On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora wrote: > Hi, > > I have the following piece of code, where I am trying to transform a spark > stream and add min and max to it of eachRDD. However, I get an error saying > max call does not exist, at run-time (compiles properly). I am using > spark-1.4 > > I have added the question to stackoverflow as well: > http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 > > Any help is greatly appreciated :) > > Thanks > Nipun > > JavaPairDStream, Tuple3> > sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2()); > > sortedtsStream.foreach( > new Function, Tuple3 Long>>, Void>() { > @Override > public Void call(JavaPairRDD, > Tuple3> tuple2Tuple3JavaPairRDD) throws Exception { > List, Tuple3> > >templist = tuple2Tuple3JavaPairRDD.collect(); > for(Tuple2, Tuple3> > tuple :templist){ > > Date date = new Date(tuple._1._1); > int pattern = tuple._1._2; > int count = tuple._2._1(); > Date maxDate = new Date(tuple._2._2()); > Date minDate = new Date(tuple._2._2()); > System.out.println("TimeSlot: " + date.toString() + " > Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() + " > Min: " + minDate.toString()); > > } > return null; > } > } > ); > > Error: > > > 15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in > memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18 11:05:06 > INFO BlockGenerator: Pushed block input-0-1434639906000Exception in thread > "JobGenerator" java.lang.NoSuchMethodError: > org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2; > at > org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346) > at > org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340) > at > org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360) > at > org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) > at > org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf > >
[Spark Streaming] Runtime Error in call to max function for JavaPairRDD
Hi, I have the following piece of code, where I am trying to transform a spark stream and add min and max to it of eachRDD. However, I get an error saying max call does not exist, at run-time (compiles properly). I am using spark-1.4 I have added the question to stackoverflow as well: http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 Any help is greatly appreciated :) Thanks Nipun JavaPairDStream, Tuple3> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2()); sortedtsStream.foreach( new Function, Tuple3>, Void>() { @Override public Void call(JavaPairRDD, Tuple3> tuple2Tuple3JavaPairRDD) throws Exception { List, Tuple3> >templist = tuple2Tuple3JavaPairRDD.collect(); for(Tuple2, Tuple3> tuple :templist){ Date date = new Date(tuple._1._1); int pattern = tuple._1._2; int count = tuple._2._1(); Date maxDate = new Date(tuple._2._2()); Date minDate = new Date(tuple._2._2()); System.out.println("TimeSlot: " + date.toString() + " Pattern: " + pattern + " Count: " + count + " Max: " + maxDate.toString() + " Min: " + minDate.toString()); } return null; } } ); Error: 15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)15/06/18 11:05:06 INFO BlockGenerator: Pushed block input-0-1434639906000Exception in thread "JobGenerator" java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2; at org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346) at org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340) at org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361) at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonf
Filling Parquet files by values in Value of a JavaPairRDD
Hello Sparkers, I'm reading data from a CSV file, applying some transformations and ending up with an RDD of pairs (String,Iterable<>). I have already prepared Parquet files. I want now to take the previous (key,value) RDD and populate the parquet files like follows: - key holds the name of the Parquet file. - value holds values to save in the parquet file whose name is the key. I tried the simplest way that one can think of: creating a DataFrame inside a 'map' or 'foreach' on the pair RDD, but this gave NullPointerException. I read and found that this is because of nesting RDDs, which is not allowed. Any help of how to achieve this in another way? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Filling-Parquet-files-by-values-in-Value-of-a-JavaPairRDD-tp23188.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: JavaPairRDD
Thank you Tristan. It is totally what I am looking for :) 2015-05-14 5:05 GMT+03:00 Tristan Blakers : > You could use a map() operation, but the easiest way is probably to just > call values() method on the JavaPairRDD to get a JavaRDD. > > See this link: > > https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html > > Tristan > > > > > > On 13 May 2015 at 23:12, Yasemin Kaya wrote: > >> Hi, >> >> I want to get *JavaPairRDD *from the tuple part of >> *JavaPairRDD> Tuple2> .* >> >> As an example: ( >> http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551)) in >> my *JavaPairRDD> *and I want to get >> *( (46551), (0,1,0,0,0,0,0,0) )* >> >> I try to split tuple._2() and create new JavaPairRDD but I can't. >> How can I get that ? >> >> Have a nice day >> yasemin >> -- >> hiç ender hiç >> > > -- hiç ender hiç
Re: JavaPairRDD
You could use a map() operation, but the easiest way is probably to just call values() method on the JavaPairRDD to get a JavaRDD. See this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html Tristan On 13 May 2015 at 23:12, Yasemin Kaya wrote: > Hi, > > I want to get *JavaPairRDD *from the tuple part of > *JavaPairRDD Tuple2> .* > > As an example: ( > http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551)) in > my *JavaPairRDD> *and I want to get > *( (46551), (0,1,0,0,0,0,0,0) )* > > I try to split tuple._2() and create new JavaPairRDD but I can't. > How can I get that ? > > Have a nice day > yasemin > -- > hiç ender hiç >
JavaPairRDD
Hi, I want to get *JavaPairRDD *from the tuple part of *JavaPairRDD> .* As an example: ( http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551)) in my *JavaPairRDD> *and I want to get *( (46551), (0,1,0,0,0,0,0,0) )* I try to split tuple._2() and create new JavaPairRDD but I can't. How can I get that ? Have a nice day yasemin -- hiç ender hiç
Re: [STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges
You need to access the underlying RDD with .rdd() and cast that. That works for me. On Mon, Apr 20, 2015 at 4:41 AM, RimBerry wrote: > Hi everyone, > > i am trying to use the direct approach in streaming-kafka-integration > <http://spark.apache.org/docs/latest/streaming-kafka-integration.html> > pulling data from kafka as follow > > JavaPairInputDStream messages = > KafkaUatils.createDirectStream(jssc, > > String.class, > > String.class, > > StringDecoder.class, > > StringDecoder.class, > > kafkaParams, > > topicsSet); > > messages.foreachRDD( > new Function, Void>() { > @Override > public Void call(JavaPairRDD String> rdd) throws IOException { > OffsetRange[] offsetRanges = > ((HasOffsetRanges) rdd).offsetRanges(); > //. > return null; > } > } > ); > > then i got an error when running it > *java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot > be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at > "OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();" > > i am using the version 1.3.1 if is it a bug in this version ? > > Thank you for spending time with me. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges
Hi everyone, i am trying to use the direct approach in streaming-kafka-integration <http://spark.apache.org/docs/latest/streaming-kafka-integration.html> pulling data from kafka as follow JavaPairInputDStream messages = KafkaUatils.createDirectStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet); messages.foreachRDD( new Function, Void>() { @Override public Void call(JavaPairRDD rdd) throws IOException { OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges(); //. return null; } } ); then i got an error when running it *java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at "OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();" i am using the version 1.3.1 if is it a bug in this version ? Thank you for spending time with me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: JavaPairRDD to JavaPairRDD based on key
So, each key-value pair gets a new value for the original key? you want mapValues(). On Wed, Sep 10, 2014 at 2:01 PM, Tom wrote: > Is it possible to generate a JavaPairRDD from a > JavaPairRDD, where I can also use the key values? I have > looked at for instance mapToPair, but this generates a new K/V pair based on > the original value, and does not give me information about the key. > > I need this in the initialization phase, where I have two RDD's with similar > keys, but with different types of values. Generating these is computational > intensive, and if I could use the first list to generate the second, it > would save me a big map/reduce phase. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/JavaPairRDD-String-Integer-to-JavaPairRDD-String-String-based-on-key-tp13875.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
JavaPairRDD to JavaPairRDD based on key
Is it possible to generate a JavaPairRDD from a JavaPairRDD, where I can also use the key values? I have looked at for instance mapToPair, but this generates a new K/V pair based on the original value, and does not give me information about the key. I need this in the initialization phase, where I have two RDD's with similar keys, but with different types of values. Generating these is computational intensive, and if I could use the first list to generate the second, it would save me a big map/reduce phase. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JavaPairRDD-String-Integer-to-JavaPairRDD-String-String-based-on-key-tp13875.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org