Re: [RESULT] [VOTE] Release Apache Spark 1.2.2
Sorry, I couldn't catch up before closing the voting.If it still counts, mvn package fails (1). And didn't run test (2). So, -1.1.mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package 2. mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test Error: [INFO] Spark Project External Flume Sink .. SUCCESS [ 39.561 s] [INFO] Spark Project External Flume ... FAILURE [ 11.212 s] [INFO] Spark Project External MQTT SKIPPED [INFO] Spark Project External ZeroMQ .. SKIPPED [INFO] Spark Project External Kafka ... SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Project YARN Shuffle Service . SKIPPED Thanking you. With Regards Sree On Thursday, April 16, 2015 3:42 PM, Patrick Wendell pwend...@gmail.com wrote: I'm gonna go ahead and close this now - thanks everyone for voting! This vote passes with 7 +1 votes (6 binding) and no 0 or -1 votes. +1: Mark Hamstra* Reynold Xin Kirshna Sankar Sean Owen* Tom Graves* Joseph Bradley* Sean McNamara* 0: -1: Thanks! - Patrick On Thu, Apr 16, 2015 at 3:27 PM, Sean Owen so...@cloudera.com wrote: No, of course Jenkins runs tests. The way some of the tests work, they need the build artifacts to have been created first. So it runs mvn ... -DskipTests package then mvn ... test On Thu, Apr 16, 2015 at 11:09 PM, Sree V sree_at_ch...@yahoo.com wrote: In my effort to vote for this release, I found these along: This is from jenkins. It uses -DskipTests. [centos] $ /home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5/bin/mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 -Dlabel=centos -DskipTests clean package We build on our locals / servers using same flag. Usually, for releases we build with running all the tests, as well. and at some level of code coverage. Are we by-passing it ? Thanking you. With Regards Sree On Wednesday, April 15, 2015 3:32 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Ran tests on OS X +1 Sean On Apr 14, 2015, at 10:59 PM, Patrick Wendell pwend...@gmail.com wrote: I'd like to close this vote to coincide with the 1.3.1 release, however, it would be great to have more people test this release first. I'll leave it open for a bit longer and see if others can give a +1. On Tue, Apr 14, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote: +1 from me ass well. On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen so...@cloudera.com wrote: I think that's close enough for a +1: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. JIRAs with target version = 1.2.x look legitimate; no blockers. I still observe several Hive test failures with: mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test .. though again I think these are not regressions but known issues in older branches. FYI there are 16 Critical issues still open for 1.2.x: SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/5/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4568,Publish release candidates under $VERSION-RCX instead of $VERSION,Patrick Wendell,Open,11/24/14 SPARK-4520,SparkSQL exception when reading certain columns from a parquet file,sadhan sood,Open,1/21/15 SPARK-4514,SparkContext localProperties does not inherit property updates across thread reuse,Josh Rosen,Open,3/31/15 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14 SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15 SPARK-4194,Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state,,In Progress,4/2/15 SPARK-4159,Maven build doesn't run JUnit test suites,Sean Owen,Open,1/11/15 SPARK-4106,Shuffle write and spill to disk metrics are incorrect,,Open,10/28/14 SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14 SPARK-3461,Support external groupByKey using repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14 SPARK-2984,FileNotFoundException on _temporary directory,,Open,12/11/14 SPARK-2532,Fix issues with consolidated shuffle,,Open,3/26/15 SPARK-1312,Batch should read based on the batch interval provided in the
Re: [RESULT] [VOTE] Release Apache Spark 1.2.2
Sree that doesn't show any error, so it doesn't help. I built with the same flags when I tested and it succeeded. On Fri, Apr 17, 2015 at 8:53 AM, Sree V sree_at_ch...@yahoo.com.invalid wrote: Sorry, I couldn't catch up before closing the voting.If it still counts, mvn package fails (1). And didn't run test (2). So, -1.1.mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package 2. mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test Error: [INFO] Spark Project External Flume Sink .. SUCCESS [ 39.561 s] [INFO] Spark Project External Flume ... FAILURE [ 11.212 s] [INFO] Spark Project External MQTT SKIPPED [INFO] Spark Project External ZeroMQ .. SKIPPED [INFO] Spark Project External Kafka ... SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Project YARN Shuffle Service . SKIPPED Thanking you. With Regards Sree On Thursday, April 16, 2015 3:42 PM, Patrick Wendell pwend...@gmail.com wrote: I'm gonna go ahead and close this now - thanks everyone for voting! This vote passes with 7 +1 votes (6 binding) and no 0 or -1 votes. +1: Mark Hamstra* Reynold Xin Kirshna Sankar Sean Owen* Tom Graves* Joseph Bradley* Sean McNamara* 0: -1: Thanks! - Patrick On Thu, Apr 16, 2015 at 3:27 PM, Sean Owen so...@cloudera.com wrote: No, of course Jenkins runs tests. The way some of the tests work, they need the build artifacts to have been created first. So it runs mvn ... -DskipTests package then mvn ... test On Thu, Apr 16, 2015 at 11:09 PM, Sree V sree_at_ch...@yahoo.com wrote: In my effort to vote for this release, I found these along: This is from jenkins. It uses -DskipTests. [centos] $ /home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5/bin/mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 -Dlabel=centos -DskipTests clean package We build on our locals / servers using same flag. Usually, for releases we build with running all the tests, as well. and at some level of code coverage. Are we by-passing it ? Thanking you. With Regards Sree On Wednesday, April 15, 2015 3:32 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Ran tests on OS X +1 Sean On Apr 14, 2015, at 10:59 PM, Patrick Wendell pwend...@gmail.com wrote: I'd like to close this vote to coincide with the 1.3.1 release, however, it would be great to have more people test this release first. I'll leave it open for a bit longer and see if others can give a +1. On Tue, Apr 14, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote: +1 from me ass well. On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen so...@cloudera.com wrote: I think that's close enough for a +1: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. JIRAs with target version = 1.2.x look legitimate; no blockers. I still observe several Hive test failures with: mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test .. though again I think these are not regressions but known issues in older branches. FYI there are 16 Critical issues still open for 1.2.x: SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/5/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4568,Publish release candidates under $VERSION-RCX instead of $VERSION,Patrick Wendell,Open,11/24/14 SPARK-4520,SparkSQL exception when reading certain columns from a parquet file,sadhan sood,Open,1/21/15 SPARK-4514,SparkContext localProperties does not inherit property updates across thread reuse,Josh Rosen,Open,3/31/15 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14 SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15 SPARK-4194,Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state,,In Progress,4/2/15 SPARK-4159,Maven build doesn't run JUnit test suites,Sean Owen,Open,1/11/15 SPARK-4106,Shuffle write and spill to disk metrics are incorrect,,Open,10/28/14 SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14 SPARK-3461,Support external groupByKey using repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14
Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?
Hi, I did some tests on Parquet Files with Spark SQL DataFrame API. I generated 36 gzip compressed parquet files by Spark SQL and stored them on Tachyon,The size of each file is about 222M.Then read them with below code. val tfs =sqlContext.parquetFile(tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick); Next,I just save this DataFrame onto HDFS with below code.It will generate 36 parquet files too,but the size of each file is about 265M tfs.repartition(36).saveAsParquetFile(/user/zhangxf/adClick-parquet-tachyon); My question is Why the files on HDFS has different size with those on Tachyon even though they come from the same original data? Thanks Zhang Xiongfei
Re: Gitter chat room for Spark
There are N chat options out there, and of course there's no need or way to stop people from using them. If 1 is blessed as 'best', it excludes others who prefer a different one. Tomorrow there will be a New Best Chat App. If a bunch are blessed, the conversation fractures. There's also a principle that important-ish discussions should take place on official project discussion forums, i.e., the mailing lists. Chat is just chat but sometimes discussions appropriate to the list happen there. For this reason I've always thought it best to punt on official-ish chat and let it happen organically as it will, with a request to port any important discussions to the list. On Fri, Apr 17, 2015 at 7:41 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Freenode already has a bit active channel under #Apache-spark, I think Josh idle there sometimes. Thanks Best Regards On Fri, Apr 17, 2015 at 3:33 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Would we be interested in having a public chat room? Gitter http://gitter.im offers them for free for open source projects. It's like web-based IRC. Check out the Docker room for example: https://gitter.im/docker/docker And if people prefer to use actual IRC, Gitter offers a bridge for that https://irc.gitter.im/ to their service. All we need is someone who's a member of the Apache GitHub group to create a room for us. It should show up under https://gitter.im/apache/spark when it's ready. What do y'all think? Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Addition of new Metrics for killed executors.
Hi, We are planning to add new Metrics in Spark for the executors that got killed during the execution. Was just curious, why this info is not already present. Is there some reason for not adding it.? Any ideas around are welcome. Thanks and Regards, Archit Thakur.
[Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Fwd: Addition of new Metrics for killed executors.
-- Forwarded message -- From: Archit Thakur archit279tha...@gmail.com Date: Fri, Apr 17, 2015 at 4:07 PM Subject: Addition of new Metrics for killed executors. To: u...@spark.incubator.apache.org, u...@spark.apache.org, d...@spark.incubator.apache.org Hi, We are planning to add new Metrics in Spark for the executors that got killed during the execution. Was just curious, why this info is not already present. Is there some reason for not adding it.? Any ideas around are welcome. Thanks and Regards, Archit Thakur.
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit : I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
Please do! Thanks. On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit : I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
BUG: 1.3.0 org.apache.spark.sql.Row Does not exist in Java API
Hi The example given in SQL document https://spark.apache.org/docs/latest/sql-programming-guide.html org.apache.spark.sql.Row Does not exist in Java API or atleast I was not able to find it. Build Info - Downloaded from spark website Dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-sql_2.10/artifactId version1.3.0/version scopeprovided/scope /dependency Code in documentation // Import factory methods provided by DataType.import org.apache.spark.sql.types.DataType;// Import StructType and StructFieldimport org.apache.spark.sql.types.StructType;import org.apache.spark.sql.types.StructField;// Import Row.import org.apache.spark.sql.Row; // sc is an existing JavaSparkContext.SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); // Load a text file and convert each line to a JavaBean.JavaRDDString people = sc.textFile(examples/src/main/resources/people.txt); // The schema is encoded in a stringString schemaString = name age; // Generate the schema based on the string of schemaListStructField fields = new ArrayListStructField();for (String fieldName: schemaString.split( )) { fields.add(DataType.createStructField(fieldName, DataType.StringType, true));}StructType schema = DataType.createStructType(fields); // Convert records of the RDD (people) to Rows.JavaRDDRow rowRDD = people.map( new FunctionString, Row() { public Row call(String record) throws Exception { String[] fields = record.split(,); return Row.create(fields[0], fields[1].trim()); } }); // Apply the schema to the RDD.DataFrame peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema); // Register the DataFrame as a table.peopleDataFrame.registerTempTable(people); // SQL can be run over RDDs that have been registered as tables.DataFrame results = sqlContext.sql(SELECT name FROM people); // The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.ListString names = results.map(new FunctionRow, String() { public String call(Row row) { return Name: + row.getString(0); } }).collect(); Thanks Nipun
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
No there isn't a convention. Although if you want to show java 8, you should also show java 6/7 syntax since there are still more 7 users than 8. On Fri, Apr 17, 2015 at 3:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Is there any convention *not* to show java 8 versions in the documentation ? Le ven. 17 avr. 2015 à 21:39, Reynold Xin r...@databricks.com a écrit : Please do! Thanks. On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit : I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com
Re: BUG: 1.3.0 org.apache.spark.sql.Row Does not exist in Java API
Hi Nipun, I'm sorry but I don't understand exactly what your problem is ? Regarding the org.apache.spark.sql.Row, it does exists in the Spark SQL dependency. Is it a compilation problem ? Are you trying to run a main method using the pom you've just described ? or are you trying to spark-submit the jar ? If you're trying to run a main method, the scope provided is not designed for that and will make your program fail. Regards, Olivier. Le ven. 17 avr. 2015 à 21:52, Nipun Batra bni...@gmail.com a écrit : Hi The example given in SQL document https://spark.apache.org/docs/latest/sql-programming-guide.html org.apache.spark.sql.Row Does not exist in Java API or atleast I was not able to find it. Build Info - Downloaded from spark website Dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-sql_2.10/artifactId version1.3.0/version scopeprovided/scope /dependency Code in documentation // Import factory methods provided by DataType.import org.apache.spark.sql.types.DataType;// Import StructType and StructFieldimport org.apache.spark.sql.types.StructType;import org.apache.spark.sql.types.StructField;// Import Row.import org.apache.spark.sql.Row; // sc is an existing JavaSparkContext.SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); // Load a text file and convert each line to a JavaBean.JavaRDDString people = sc.textFile(examples/src/main/resources/people.txt); // The schema is encoded in a stringString schemaString = name age; // Generate the schema based on the string of schemaListStructField fields = new ArrayListStructField();for (String fieldName: schemaString.split( )) { fields.add(DataType.createStructField(fieldName, DataType.StringType, true));}StructType schema = DataType.createStructType(fields); // Convert records of the RDD (people) to Rows.JavaRDDRow rowRDD = people.map( new FunctionString, Row() { public Row call(String record) throws Exception { String[] fields = record.split(,); return Row.create(fields[0], fields[1].trim()); } }); // Apply the schema to the RDD.DataFrame peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema); // Register the DataFrame as a table.peopleDataFrame.registerTempTable(people); // SQL can be run over RDDs that have been registered as tables.DataFrame results = sqlContext.sql(SELECT name FROM people); // The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.ListString names = results.map(new FunctionRow, String() { public String call(Row row) { return Name: + row.getString(0); } }).collect(); Thanks Nipun
Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei zhangxiongfei0...@163.com wrote: Hi, I did some tests on Parquet Files with Spark SQL DataFrame API. I generated 36 gzip compressed parquet files by Spark SQL and stored them on Tachyon,The size of each file is about 222M.Then read them with below code. val tfs =sqlContext.parquetFile(tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick); Next,I just save this DataFrame onto HDFS with below code.It will generate 36 parquet files too,but the size of each file is about 265M tfs.repartition(36).saveAsParquetFile(/user/zhangxf/adClick-parquet-tachyon); My question is Why the files on HDFS has different size with those on Tachyon even though they come from the same original data? Thanks Zhang Xiongfei
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
another PR I guess :) here's the associated Jira https://issues.apache.org/jira/browse/SPARK-6988 Le ven. 17 avr. 2015 à 23:00, Reynold Xin r...@databricks.com a écrit : No there isn't a convention. Although if you want to show java 8, you should also show java 6/7 syntax since there are still more 7 users than 8. On Fri, Apr 17, 2015 at 3:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Is there any convention *not* to show java 8 versions in the documentation ? Le ven. 17 avr. 2015 à 21:39, Reynold Xin r...@databricks.com a écrit : Please do! Thanks. On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit : I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
and the PR: https://github.com/apache/spark/pull/5564 Thank you ! Olivier. Le ven. 17 avr. 2015 à 23:00, Reynold Xin r...@databricks.com a écrit : No there isn't a convention. Although if you want to show java 8, you should also show java 6/7 syntax since there are still more 7 users than 8. On Fri, Apr 17, 2015 at 3:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Is there any convention *not* to show java 8 versions in the documentation ? Le ven. 17 avr. 2015 à 21:39, Reynold Xin r...@databricks.com a écrit : Please do! Thanks. On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit : I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are
Re: dataframe can not find fields after loading from hive
This is strange. cc the dev list since it might be a bug. On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote: Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores ces...@gmail.com wrote: I have a data frame in which I load data from a hive table. And my issue is that the data frame is missing the columns that I need to query. For example: val newdataset = dataset.where(dataset(label) === 1) gives me an error like the following: ERROR yarn.ApplicationMaster: User class threw exception: resolved attributes label missing from label, user_id, ...(the rest of the fields of my table org.apache.spark.sql.AnalysisException: resolved attributes label missing from label, user_id, ... (the rest of the fields of my table) where we can see that the label field actually exist. I manage to solve this issue by updating my syntax to: val newdataset = dataset.where($label === 1) which works. However I can not make this trick in all my queries. For example, when I try to do a unionAll from two subsets of the same data frame the error I am getting is that all my fields are missing. Can someone tell me if I need to do some post processing after loading from hive in order to avoid this kind of errors? Thanks -- Cesar Flores -- Cesar Flores
Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}
Is there any convention *not* to show java 8 versions in the documentation ? Le ven. 17 avr. 2015 à 21:39, Reynold Xin r...@databricks.com a écrit : Please do! Thanks. On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit : I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you were referring to: override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f) Cheers On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I had an issue trying to use Spark SQL from Java (8 or 7), I tried to reproduce it in a small test case close to the actual documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection , so sorry for the long mail, but this is Java : import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; import java.io.Serializable; import java.util.ArrayList; import java.util.Arrays; import java.util.List; class Movie implements Serializable { private int id; private String name; public Movie(int id, String name) { this.id = id; this.name = name; } public int getId() { return id; } public void setId(int id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } public class SparkSQLTest { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ArrayListMovie movieArrayList = new ArrayListMovie(); movieArrayList.add(new Movie(1, Indiana Jones)); JavaRDDMovie movies = sc.parallelize(movieArrayList); SQLContext sqlContext = new SQLContext(sc); DataFrame frame = sqlContext.applySchema(movies, Movie.class); frame.registerTempTable(movies); sqlContext.sql(select name from movies) *.map(row - row.getString(0)) // this is what i would expect to work *.collect(); } } But this does not compile, here's the compilation error : [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] *required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR * [ERROR]* found: (row)-Na[...]ng(0) * [ERROR] *reason: cannot infer type-variable(s) R * [ERROR] *(actual and formal argument lists differ in length) * [ERROR] /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17] method map in class org.apache.spark.sql.DataFrame cannot be applied to given types; [ERROR] required: scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR [ERROR] found: (row)-row[...]ng(0) [ERROR] reason: cannot infer type-variable(s) R [ERROR] (actual and formal argument lists differ in length) [ERROR] - [Help 1] Because in the DataFrame the *map *method is defined as : [image: Images intégrées 1] And once this is translated to bytecode the actual Java signature uses a Function1 and adds a ClassTag parameter. I can try to go around this and use the scala.reflect.ClassTag$ like that : ClassTag$.MODULE$.apply(String.class) To get the second ClassTag parameter right, but then instantiating a java.util.Function or using the Java 8 lambdas fail to work, and if I try to instantiate a proper scala Function1... well this is a world of pain. This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are not callable with JFunctions), I can open a Jira if you want ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: [RESULT] [VOTE] Release Apache Spark 1.2.2
cleaned up ~/.m2 and ~/.zinc. received exact same error, again. So, -1 from me. [INFO] [INFO] Building Spark Project External Flume 1.2.2 [INFO] [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ spark-streaming-flume_2.10 --- [INFO] Deleting /root/sources/github/spark/external/flume/target [INFO] [INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) @ spark-streaming-flume_2.10 --- [INFO] [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ spark-streaming-flume_2.10 --- [INFO] Source directory: /root/sources/github/spark/external/flume/src/main/scala added. [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-streaming-flume_2.10 --- [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ spark-streaming-flume_2.10 --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /root/sources/github/spark/external/flume/src/main/resources [INFO] Copying 3 resources [INFO] [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-streaming-flume_2.10 --- [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile [INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 6 Scala sources and 1 Java source to /root/sources/github/spark/external/flume/target/scala-2.10/classes... [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeBatchFetcher.scala:22: object Throwables is not a member of package com.google.common.base [ERROR] import com.google.common.base.Throwables [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeBatchFetcher.scala:59: not found: value Throwables [ERROR] Throwables.getRootCause(e) match { [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePollingInputDStream.scala:26: object util is not a member of package com.google.common [ERROR] import com.google.common.util.concurrent.ThreadFactoryBuilder [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePollingInputDStream.scala:69: not found: type ThreadFactoryBuilder [ERROR] Executors.newCachedThreadPool(new ThreadFactoryBuilder().setDaemon(true). [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePollingInputDStream.scala:76: not found: type ThreadFactoryBuilder [ERROR] new ThreadFactoryBuilder().setDaemon(true).setNameFormat(Flume Receiver Thread - %d).build()) [ERROR] ^ [ERROR] 5 errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [01:44 min] [INFO] Spark Project Networking ... SUCCESS [ 49.128 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.503 s] [INFO] Spark Project Core . SUCCESS [05:22 min] [INFO] Spark Project Bagel SUCCESS [ 25.647 s] [INFO] Spark Project GraphX ... SUCCESS [01:13 min] [INFO] Spark Project Streaming SUCCESS [01:29 min] [INFO] Spark Project Catalyst . SUCCESS [01:51 min] [INFO] Spark Project SQL .. SUCCESS [01:57 min] [INFO] Spark Project ML Library ... SUCCESS [02:25 min] [INFO] Spark Project Tools SUCCESS [ 16.665 s] [INFO] Spark Project Hive . SUCCESS [02:03 min] [INFO] Spark Project REPL . SUCCESS [ 50.294 s] [INFO] Spark Project YARN Parent POM .. SUCCESS [ 5.777 s] [INFO] Spark Project YARN Stable API .. SUCCESS [ 53.803 s] [INFO] Spark Project Assembly . SUCCESS [ 59.515 s] [INFO] Spark Project External Twitter . SUCCESS [ 40.038 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 32.779 s] [INFO] Spark Project External Flume ... FAILURE [ 7.936 s] [INFO] Spark Project External MQTT SKIPPED [INFO] Spark Project External ZeroMQ .. SKIPPED [INFO] Spark Project External Kafka ... SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Project YARN Shuffle Service
Announcing Spark 1.3.1 and 1.2.2
Hi All, I'm happy to announce the Spark 1.3.1 and 1.2.2 maintenance releases. We recommend all users on the 1.3 and 1.2 Spark branches upgrade to these releases, which contain several important bug fixes. Download Spark 1.3.1 or 1.2.2: http://spark.apache.org/downloads.html Release notes: 1.3.1: http://spark.apache.org/releases/spark-release-1-3-1.html 1.2.2: http://spark.apache.org/releases/spark-release-1-2-2.html Comprehensive list of fixes: 1.3.1: http://s.apache.org/spark-1.3.1 1.2.2: http://s.apache.org/spark-1.2.2 Thanks to everyone who worked on these releases! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [RESULT] [VOTE] Release Apache Spark 1.2.2
Hi Sean, This is from build log. I made a master branch build earlier on this machine.Do you think, it needs a clean up of .m2 folder, that you suggested in onetime earlier ?Giving it another try, while you take a look at this. [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-streaming-flume_2.10 --- [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile [INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 6 Scala sources and 1 Java source to /root/sources/github/spark/external/flume/target/scala-2 .10/classes... [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeBatchFe tcher.scala:22: object Throwables is not a member of package com.google.common.base [ERROR] import com.google.common.base.Throwables [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeBatchFe tcher.scala:59: not found: value Throwables [ERROR] Throwables.getRootCause(e) match { [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePolling InputDStream.scala:26: object util is not a member of package com.google.common [ERROR] import com.google.common.util.concurrent.ThreadFactoryBuilder [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePolling InputDStream.scala:69: not found: type ThreadFactoryBuilder [ERROR] Executors.newCachedThreadPool(new ThreadFactoryBuilder().setDaemon(true). [ERROR] ^ [ERROR] /root/sources/github/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePolling InputDStream.scala:76: not found: type ThreadFactoryBuilder [ERROR] new ThreadFactoryBuilder().setDaemon(true).setNameFormat(Flume Receiver Thread - %d).build()) [ERROR] ^ [ERROR] 5 errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 15.894 s] [INFO] Spark Project Networking ... SUCCESS [ 20.801 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 18.111 s] [INFO] Spark Project Core . SUCCESS [08:09 min] [INFO] Spark Project Bagel SUCCESS [ 43.592 s] [INFO] Spark Project GraphX ... SUCCESS [01:55 min] [INFO] Spark Project Streaming SUCCESS [03:02 min] [INFO] Spark Project Catalyst . SUCCESS [02:59 min] [INFO] Spark Project SQL .. SUCCESS [03:09 min] [INFO] Spark Project ML Library ... SUCCESS [03:24 min] [INFO] Spark Project Tools SUCCESS [ 24.816 s] [INFO] Spark Project Hive . SUCCESS [02:14 min] [INFO] Spark Project REPL . SUCCESS [01:12 min] [INFO] Spark Project YARN Parent POM .. SUCCESS [ 6.080 s] [INFO] Spark Project YARN Stable API .. SUCCESS [01:27 min] [INFO] Spark Project Assembly . SUCCESS [01:22 min] [INFO] Spark Project External Twitter . SUCCESS [ 35.881 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 39.561 s] [INFO] Spark Project External Flume ... FAILURE [ 11.212 s] [INFO] Spark Project External MQTT SKIPPED [INFO] Spark Project External ZeroMQ .. SKIPPED [INFO] Spark Project External Kafka ... SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Project YARN Shuffle Service . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 32:36 min [INFO] Finished at: 2015-04-16T23:02:18-07:00 [INFO] Final Memory: 91M/2043M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on pr oject spark-streaming-flume_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin: 3.2.0:compile failed. CompileFailed - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1]
Re: Spark streaming vs. spark usage
I finally got this compiling and working, I think, but since (as Reynold points out) it involves a little API refactoring, I was hoping to get some discussion about it going as soon as possible. I have the changes necessary to give RDD, DStream, and DataFrame some level of common interface, in https://github.com/apache/spark/pull/5565, and would very much appreciate comments. Thanks, Nathan On Thu, Dec 19, 2013 at 12:42 AM, Reynold Xin r...@apache.org wrote: On Wed, Dec 18, 2013 at 12:17 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: Since many of the functions exist in parallel between the two, I guess I would expect something like: trait BasicRDDFunctions { def map... def reduce... def filter... def foreach... } class RDD extends BasicRDDFunctions... class DStream extends BasicRDDFunctions... I like this idea. We should discuss more about it on the dev list. It would require refactoring some APIs, but does lead to better unification.