Re: spark 1.4 - test-loading 1786 mysql tables / a few TB
Hi, I'm using sqlContext.jdbc(uri, table, where).map(_ = 1).aggregate(0)(_+_,_+_) on an interactive shell (where where is an Array[String] of 32 to 48 elements). (The code is tailored to your db, specifically through the where conditions, I'd have otherwise post it) That should be the DataFrame API, but I'm just trying to load everything and discard it as soon as possible :-) (1) Never do a silent drop of the values by default: it kills confidence. An option sounds reasonable. Some sort of insight / log would be great. (How many columns of what type were truncated? why?) Note that I could declare the field as string via JdbcDialects (thank you guys for merging that :-) ). I have quite bad experiences with silent drops / truncates of columns and thus _like_ the strict way of spark. It causes trouble but noticing later that your data was corrupted during conversion is even worse. (2) SPARK-8004 https://issues.apache.org/jira/browse/SPARK-8004 (3) One option would be to make it safe to use, the other option would be to document the behavior (s.th. like WARNING: this method tries to load as many partitions as possible, make sure your database can handle the load or load them in chunks and use union). SPARK-8008 https://issues.apache.org/jira/browse/SPARK-8008 Regards, Rene Treffer
Re: spark 1.4 - test-loading 1786 mysql tables / a few TB
Never mind my comment about 3. You were talking about the read side, while I was thinking about the write side. Your workaround actually is a pretty good idea. Can you create a JIRA for that as well? On Monday, June 1, 2015, Reynold Xin r...@databricks.com wrote: René, Thanks for sharing your experience. Are you using the DataFrame API or SQL? (1) Any recommendations on what we do w.r.t. out of range values? Should we silently turn them into a null? Maybe based on an option? (2) Looks like a good idea to always quote column names. The small tricky thing is each database seems to have its own unique quotes. Do you mind filing a JIRA for this? (3) It is somewhat hard to do because that'd require changing Spark's task scheduler. The easiest way maybe to coalesce it into a smaller number of partitions -- or we can coalesce for the user based on an option. Can you file a JIRA for this also? Thanks! On Mon, Jun 1, 2015 at 1:10 AM, René Treffer rtref...@gmail.com javascript:_e(%7B%7D,'cvml','rtref...@gmail.com'); wrote: Hi *, I used to run into a few problems with the jdbc/mysql integration and thought it would be nice to load our whole db, doing nothing but .map(_ = 1).aggregate(0)(_+_,_+_) on the DataFrames. SparkSQL has to load all columns and process them so this should reveal type errors like SPARK-7897 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD https://issues.apache.org/jira/browse/SPARK-7897 SPARK-7697 https://issues.apache.org/jira/browse/SPARK-7697Column with an unsigned int should be treated as long in JDBCRDD The test was done on the 1.4 branch (checkout 2-3 days ago, local build, running standalone with a 350G heap). 1. Date/Timestamp -00-00 org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage 18.0 (TID 186, localhost): java.sql.SQLException: Value '-00-00 00:00:00' can not be represented as java.sql.Timestamp at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 636, localhost): java.sql.SQLException: Value '-00-00' can not be represented as java.sql.Date at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) This was the most common error when I tried to load tables with Date/Timestamp types. Can be worked around by subqueries or by specifying those types to be string and handling them afterwards. 2. Keywords as column names fail SparkSQL does not enclose column names, e.g. SELECT key,value FROM tablename fails and should be SELECT `key`,`value` FROM tablename org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' I'm not sure how to work around that issue except for manually writing sub-queries (with all the performance problems that may cause). 3. Overloaded DB due to concurrency While providing where clauses works well to parallelize the fetch it can overload the DB, thus causing trouble (e.g. query/connection timeouts due to an overloaded DB). It would be nice to specify fetch parallelism independent from result partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel). This can be emulated by loading just n partition at a time and doing a union afterwards. (gouped(5).map(...)) Success I've successfully loaded 8'573'651'154 rows from 1667 tables ( 93% success rate). This is pretty awesome given that e.g. sqoop has failed horrible on the same data. Note that I didn't not verify that the retrieved data is valid. I've only checked for fetch errors so far. But given that spark does not silence any errors I'm quite confident that fetched data will be valid :-) Regards, Rene Treffer
spark 1.4 - test-loading 1786 mysql tables / a few TB
Hi *, I used to run into a few problems with the jdbc/mysql integration and thought it would be nice to load our whole db, doing nothing but .map(_ = 1).aggregate(0)(_+_,_+_) on the DataFrames. SparkSQL has to load all columns and process them so this should reveal type errors like SPARK-7897 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD https://issues.apache.org/jira/browse/SPARK-7897 SPARK-7697 https://issues.apache.org/jira/browse/SPARK-7697Column with an unsigned int should be treated as long in JDBCRDD The test was done on the 1.4 branch (checkout 2-3 days ago, local build, running standalone with a 350G heap). 1. Date/Timestamp -00-00 org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage 18.0 (TID 186, localhost): java.sql.SQLException: Value '-00-00 00:00:00' can not be represented as java.sql.Timestamp at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 636, localhost): java.sql.SQLException: Value '-00-00' can not be represented as java.sql.Date at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) This was the most common error when I tried to load tables with Date/Timestamp types. Can be worked around by subqueries or by specifying those types to be string and handling them afterwards. 2. Keywords as column names fail SparkSQL does not enclose column names, e.g. SELECT key,value FROM tablename fails and should be SELECT `key`,`value` FROM tablename org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' I'm not sure how to work around that issue except for manually writing sub-queries (with all the performance problems that may cause). 3. Overloaded DB due to concurrency While providing where clauses works well to parallelize the fetch it can overload the DB, thus causing trouble (e.g. query/connection timeouts due to an overloaded DB). It would be nice to specify fetch parallelism independent from result partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel). This can be emulated by loading just n partition at a time and doing a union afterwards. (gouped(5).map(...)) Success I've successfully loaded 8'573'651'154 rows from 1667 tables ( 93% success rate). This is pretty awesome given that e.g. sqoop has failed horrible on the same data. Note that I didn't not verify that the retrieved data is valid. I've only checked for fetch errors so far. But given that spark does not silence any errors I'm quite confident that fetched data will be valid :-) Regards, Rene Treffer
Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on
Is this backported to branch 1.3? On 31 May 2015, at 00:44, Reynold Xin r...@databricks.commailto:r...@databricks.com wrote: FYI we merged a patch that improves unit test log debugging. In order for that to work, all test suites have been changed to extend SparkFunSuite instead of ScalaTest's FunSuite. We also added a rule in the Scala style checker to fail Jenkins if FunSuite is used. The patch that introduced SparkFunSuite: https://github.com/apache/spark/pull/6441 Style check patch: https://github.com/apache/spark/pull/6510
Re: spark 1.4 - test-loading 1786 mysql tables / a few TB
René, Thanks for sharing your experience. Are you using the DataFrame API or SQL? (1) Any recommendations on what we do w.r.t. out of range values? Should we silently turn them into a null? Maybe based on an option? (2) Looks like a good idea to always quote column names. The small tricky thing is each database seems to have its own unique quotes. Do you mind filing a JIRA for this? (3) It is somewhat hard to do because that'd require changing Spark's task scheduler. The easiest way maybe to coalesce it into a smaller number of partitions -- or we can coalesce for the user based on an option. Can you file a JIRA for this also? Thanks! On Mon, Jun 1, 2015 at 1:10 AM, René Treffer rtref...@gmail.com wrote: Hi *, I used to run into a few problems with the jdbc/mysql integration and thought it would be nice to load our whole db, doing nothing but .map(_ = 1).aggregate(0)(_+_,_+_) on the DataFrames. SparkSQL has to load all columns and process them so this should reveal type errors like SPARK-7897 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD https://issues.apache.org/jira/browse/SPARK-7897 SPARK-7697 https://issues.apache.org/jira/browse/SPARK-7697Column with an unsigned int should be treated as long in JDBCRDD The test was done on the 1.4 branch (checkout 2-3 days ago, local build, running standalone with a 350G heap). 1. Date/Timestamp -00-00 org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage 18.0 (TID 186, localhost): java.sql.SQLException: Value '-00-00 00:00:00' can not be represented as java.sql.Timestamp at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 636, localhost): java.sql.SQLException: Value '-00-00' can not be represented as java.sql.Date at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) This was the most common error when I tried to load tables with Date/Timestamp types. Can be worked around by subqueries or by specifying those types to be string and handling them afterwards. 2. Keywords as column names fail SparkSQL does not enclose column names, e.g. SELECT key,value FROM tablename fails and should be SELECT `key`,`value` FROM tablename org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' I'm not sure how to work around that issue except for manually writing sub-queries (with all the performance problems that may cause). 3. Overloaded DB due to concurrency While providing where clauses works well to parallelize the fetch it can overload the DB, thus causing trouble (e.g. query/connection timeouts due to an overloaded DB). It would be nice to specify fetch parallelism independent from result partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel). This can be emulated by loading just n partition at a time and doing a union afterwards. (gouped(5).map(...)) Success I've successfully loaded 8'573'651'154 rows from 1667 tables ( 93% success rate). This is pretty awesome given that e.g. sqoop has failed horrible on the same data. Note that I didn't not verify that the retrieved data is valid. I've only checked for fetch errors so far. But given that spark does not silence any errors I'm quite confident that fetched data will be valid :-) Regards, Rene Treffer
GraphX: New graph operator
Hello, Someone proposed in a Jira issue to implement new graph operations. Sean Owen recommended to check first with the mailing list, if this is interesting or not. So I would like to know, if it is interesting for GraphX to implement the operators like: http://en.wikipedia.org/wiki/Graph_operations and/or http://techieme.in/complex-graph-operations/ If yes, should they be integrated into GraphImpl (like mask, subgraph etc.) or as external library? My feeling is that they are similar to mask. Because of consistency they should be part of the graph implementation itself. What do you guys think? I really would like to bring GraphX forward and help to implement some of these. Looking forward to hear your opinions Tarek
Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on
I don't think so. On Monday, June 1, 2015, Steve Loughran ste...@hortonworks.com wrote: Is this backported to branch 1.3? On 31 May 2015, at 00:44, Reynold Xin r...@databricks.com javascript:_e(%7B%7D,'cvml','r...@databricks.com'); wrote: FYI we merged a patch that improves unit test log debugging. In order for that to work, all test suites have been changed to extend SparkFunSuite instead of ScalaTest's FunSuite. We also added a rule in the Scala style checker to fail Jenkins if FunSuite is used. The patch that introduced SparkFunSuite: https://github.com/apache/spark/pull/6441 Style check patch: https://github.com/apache/spark/pull/6510
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Still have problem using HiveContext from sbt. Here’s an example of dependencies: |val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = spark-hive, base = file(.), settings = Project.defaultSettings ++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, scalaBinaryVersion := 2.10, resolvers += Spark RC at https://repository.apache.org/content/repositories/orgapachespark-1110/;, libraryDependencies ++= Seq( org.apache.spark %% spark-core % sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, org.apache.spark %% spark-hive % sparkVersion, org.apache.spark %% spark-sql % sparkVersion ) )) | Launching sbt console with it and running: |val conf = new SparkConf().setMaster(local[4]).setAppName(test) val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val data = sc.parallelize(1 to 1) import sqlContext.implicits._ scala data.toDF java.lang.IllegalArgumentException: Unable to locate hive jars to connect to metastore using classloader scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set spark.sql.hive.metastore.jars at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175) at org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366) at org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456) at org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345) | Thanks, Peter Rudenko On 2015-06-01 05:04, Guoqiang Li wrote: +1 (non-binding) -- Original -- *From: * Sandy Ryza;sandy.r...@cloudera.com; *Date: * Mon, Jun 1, 2015 07:34 AM *To: * Krishna Sankarksanka...@gmail.com; *Cc: * Patrick Wendellpwend...@gmail.com; dev@spark.apache.orgdev@spark.apache.org; *Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC3) +1 (non-binding) Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and ran some jobs. -Sandy On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar ksanka...@gmail.com mailto:ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.3.1 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 3.6. saveAsParquetFile OK 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, registerTempTable, sql OK 3.8. result = sqlContext.sql(SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK 4.0. Spark SQL from Python OK 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK Cheers k/ On Fri, May 29, 2015 at 4:40 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit dd109a8): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/ http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc3-bin/ Release artifacts are signed with the following key:
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Hi Peter, Based on your error message, seems you were not using the RC3. For the error thrown at HiveContext's line 206, we have changed the message to this one https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207 just before RC3. Basically, we will not print out the class loader name. Can you check if a older version of 1.4 branch got used? Have you published a RC3 to your local maven repo? Can you clean your local repo cache and try again? Thanks, Yin On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com wrote: Still have problem using HiveContext from sbt. Here’s an example of dependencies: val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = spark-hive, base = file(.), settings = Project.defaultSettings ++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, scalaBinaryVersion := 2.10, resolvers += Spark RC at https://repository.apache.org/content/repositories/orgapachespark-1110/; https://repository.apache.org/content/repositories/orgapachespark-1110/, libraryDependencies ++= Seq( org.apache.spark %% spark-core % sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, org.apache.spark %% spark-hive % sparkVersion, org.apache.spark %% spark-sql % sparkVersion ) )) Launching sbt console with it and running: val conf = new SparkConf().setMaster(local[4]).setAppName(test) val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val data = sc.parallelize(1 to 1) import sqlContext.implicits._ scala data.toDF java.lang.IllegalArgumentException: Unable to locate hive jars to connect to metastore using classloader scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set spark.sql.hive.metastore.jars at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175) at org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366) at org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456) at org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345) Thanks, Peter Rudenko On 2015-06-01 05:04, Guoqiang Li wrote: +1 (non-binding) -- Original -- *From: * Sandy Ryza;sandy.r...@cloudera.com sandy.r...@cloudera.com ; *Date: * Mon, Jun 1, 2015 07:34 AM *To: * Krishna Sankarksanka...@gmail.com ksanka...@gmail.com; *Cc: * Patrick Wendellpwend...@gmail.com pwend...@gmail.com; dev@spark.apache.org dev@spark.apache.orgdev@spark.apache.org dev@spark.apache.org; *Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC3) +1 (non-binding) Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and ran some jobs. -Sandy On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar ksanka...@gmail.com ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.3.1 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 3.6. saveAsParquetFile OK 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, registerTempTable, sql OK 3.8. result = sqlContext.sql(SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID)
Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on
It will be within the next few days 2015-06-01 9:17 GMT-07:00 Reynold Xin r...@databricks.com: I don't think so. On Monday, June 1, 2015, Steve Loughran ste...@hortonworks.com wrote: Is this backported to branch 1.3? On 31 May 2015, at 00:44, Reynold Xin r...@databricks.com wrote: FYI we merged a patch that improves unit test log debugging. In order for that to work, all test suites have been changed to extend SparkFunSuite instead of ScalaTest's FunSuite. We also added a rule in the Scala style checker to fail Jenkins if FunSuite is used. The patch that introduced SparkFunSuite: https://github.com/apache/spark/pull/6441 Style check patch: https://github.com/apache/spark/pull/6510
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
+1 (binding) Tested the standalone cluster mode REST submission gateway - submit / status / kill Tested simple applications on YARN client / cluster modes with and without --jars Tested python applications on YARN client / cluster modes with and without --py-files* Tested dynamic allocation on YARN client / cluster modes All good. *Filed SPARK-8017: not a blocker because python in YARN cluster mode is a new feature 2015-06-01 11:10 GMT-07:00 Yin Huai yh...@databricks.com: Hi Peter, Based on your error message, seems you were not using the RC3. For the error thrown at HiveContext's line 206, we have changed the message to this one https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207 just before RC3. Basically, we will not print out the class loader name. Can you check if a older version of 1.4 branch got used? Have you published a RC3 to your local maven repo? Can you clean your local repo cache and try again? Thanks, Yin On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com wrote: Still have problem using HiveContext from sbt. Here’s an example of dependencies: val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = spark-hive, base = file(.), settings = Project.defaultSettings ++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, scalaBinaryVersion := 2.10, resolvers += Spark RC at https://repository.apache.org/content/repositories/orgapachespark-1110/; https://repository.apache.org/content/repositories/orgapachespark-1110/, libraryDependencies ++= Seq( org.apache.spark %% spark-core % sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, org.apache.spark %% spark-hive % sparkVersion, org.apache.spark %% spark-sql % sparkVersion ) )) Launching sbt console with it and running: val conf = new SparkConf().setMaster(local[4]).setAppName(test) val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val data = sc.parallelize(1 to 1) import sqlContext.implicits._ scala data.toDF java.lang.IllegalArgumentException: Unable to locate hive jars to connect to metastore using classloader scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set spark.sql.hive.metastore.jars at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175) at org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366) at org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456) at org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345) Thanks, Peter Rudenko On 2015-06-01 05:04, Guoqiang Li wrote: +1 (non-binding) -- Original -- *From: * Sandy Ryza;sandy.r...@cloudera.com sandy.r...@cloudera.com; *Date: * Mon, Jun 1, 2015 07:34 AM *To: * Krishna Sankarksanka...@gmail.com ksanka...@gmail.com; *Cc: * Patrick Wendellpwend...@gmail.com pwend...@gmail.com; dev@spark.apache.org dev@spark.apache.orgdev@spark.apache.org dev@spark.apache.org; *Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC3) +1 (non-binding) Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and ran some jobs. -Sandy On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar ksanka...@gmail.com ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.3.1 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3.
Re: spark 1.4 - test-loading 1786 mysql tables / a few TB
Thanks, René. I actually added a warning to the new JDBC reader/writer interface for 1.4.0. Even with that, I think we should support throttling JDBC; otherwise it's too convenient for our users to DOS their production database servers! /** * Construct a [[DataFrame]] representing the database table accessible via JDBC URL * url named table. Partitions of the table will be retrieved in parallel based on the parameters * passed to this function. * * * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash* * * your external database systems.* * * @param url JDBC database url of the form `jdbc:subprotocol:subname` * @param table Name of the table in the external database. * @param columnName the name of a column of integral type that will be used for partitioning. * @param lowerBound the minimum value of `columnName` used to decide partition stride * @param upperBound the maximum value of `columnName` used to decide partition stride * @param numPartitions the number of partitions. the range `minValue`-`maxValue` will be split * evenly into this many partitions * @param connectionProperties JDBC database connection arguments, a list of arbitrary string * tag/value. Normally at least a user and password property * should be included. * * @since 1.4.0 */ On Mon, Jun 1, 2015 at 1:54 AM, René Treffer rtref...@gmail.com wrote: Hi, I'm using sqlContext.jdbc(uri, table, where).map(_ = 1).aggregate(0)(_+_,_+_) on an interactive shell (where where is an Array[String] of 32 to 48 elements). (The code is tailored to your db, specifically through the where conditions, I'd have otherwise post it) That should be the DataFrame API, but I'm just trying to load everything and discard it as soon as possible :-) (1) Never do a silent drop of the values by default: it kills confidence. An option sounds reasonable. Some sort of insight / log would be great. (How many columns of what type were truncated? why?) Note that I could declare the field as string via JdbcDialects (thank you guys for merging that :-) ). I have quite bad experiences with silent drops / truncates of columns and thus _like_ the strict way of spark. It causes trouble but noticing later that your data was corrupted during conversion is even worse. (2) SPARK-8004 https://issues.apache.org/jira/browse/SPARK-8004 (3) One option would be to make it safe to use, the other option would be to document the behavior (s.th. like WARNING: this method tries to load as many partitions as possible, make sure your database can handle the load or load them in chunks and use union). SPARK-8008 https://issues.apache.org/jira/browse/SPARK-8008 Regards, Rene Treffer
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Thanks Yin, tried on a clean VM - works now. But tests in my app still fails: |[info] Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- [info] java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$anon$1@380628de, see the next exception for details. [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.init(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection40.init(Unknown Source) [info] at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown Source) [info] at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source) [info] at org.apache.derby.jdbc.Driver20.connect(Unknown Source) [info] at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:187) [info] at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361) [info] at com.jolbox.bonecp.BoneCP.init(BoneCP.java:416) [info] at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120) | I’ve set parallelExecution in Test := false, Thanks, Peter Rudenko On 2015-06-01 21:10, Yin Huai wrote: Hi Peter, Based on your error message, seems you were not using the RC3. For the error thrown at HiveContext's line 206, we have changed the message to this one https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207 just before RC3. Basically, we will not print out the class loader name. Can you check if a older version of 1.4 branch got used? Have you published a RC3 to your local maven repo? Can you clean your local repo cache and try again? Thanks, Yin On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com mailto:petro.rude...@gmail.com wrote: Still have problem using HiveContext from sbt. Here’s an example of dependencies: |val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = spark-hive, base = file(.), settings = Project.defaultSettings ++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, scalaBinaryVersion := 2.10, resolvers += Spark RC at https://repository.apache.org/content/repositories/orgapachespark-1110/; https://repository.apache.org/content/repositories/orgapachespark-1110/, libraryDependencies ++= Seq( org.apache.spark %% spark-core % sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, org.apache.spark %% spark-hive % sparkVersion, org.apache.spark %% spark-sql % sparkVersion ) )) | Launching sbt console with it and running: |val conf = new SparkConf().setMaster(local[4]).setAppName(test) val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val data = sc.parallelize(1 to 1) import sqlContext.implicits._ scala data.toDF java.lang.IllegalArgumentException: Unable to locate hive jars to connect to metastore using classloader scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set spark.sql.hive.metastore.jars at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175) at org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366) at org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379) at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456) at org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345) | Thanks, Peter Rudenko
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Its no longer valid to start more than one instance of HiveContext in a single JVM, as one of the goals of this refactoring was to allow connection to more than one metastore from a single context. For tests I suggest you use TestHive as we do in our unit tests. It has a reset() method you can use to cleanup state between tests/suites. We could also add an explicit close() method to remove this restriction, but if thats something you want to investigate we should move this off the vote thread and onto JIRA. On Tue, Jun 2, 2015 at 7:19 AM, Peter Rudenko petro.rude...@gmail.com wrote: Thanks Yin, tried on a clean VM - works now. But tests in my app still fails: [info] Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- [info] java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$anon$1@380628de, see the next exception for details. [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.init(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection40.init(Unknown Source) [info] at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown Source) [info] at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source) [info] at org.apache.derby.jdbc.Driver20.connect(Unknown Source) [info] at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:187) [info] at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361) [info] at com.jolbox.bonecp.BoneCP.init(BoneCP.java:416) [info] at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120) I’ve set parallelExecution in Test := false, Thanks, Peter Rudenko On 2015-06-01 21:10, Yin Huai wrote: Hi Peter, Based on your error message, seems you were not using the RC3. For the error thrown at HiveContext's line 206, we have changed the message to this one https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207 just before RC3. Basically, we will not print out the class loader name. Can you check if a older version of 1.4 branch got used? Have you published a RC3 to your local maven repo? Can you clean your local repo cache and try again? Thanks, Yin On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com petro.rude...@gmail.com wrote: Still have problem using HiveContext from sbt. Here’s an example of dependencies: val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = spark-hive, base = file(.), settings = Project.defaultSettings ++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, scalaBinaryVersion := 2.10, resolvers += Spark RC at https://repository.apache.org/content/repositories/orgapachespark-1110/; https://repository.apache.org/content/repositories/orgapachespark-1110/, libraryDependencies ++= Seq( org.apache.spark %% spark-core % sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, org.apache.spark %% spark-hive % sparkVersion, org.apache.spark %% spark-sql % sparkVersion ) )) Launching sbt console with it and running: val conf = new SparkConf().setMaster(local[4]).setAppName(test) val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val data = sc.parallelize(1 to 1) import sqlContext.implicits._ scala data.toDF java.lang.IllegalArgumentException: Unable to locate hive jars to connect to metastore using classloader scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set spark.sql.hive.metastore.jars at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175) at org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366) at
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
I get a bunch of failures in VersionSuite with build/test params -Pyarn -Phive -Phadoop-2.6: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) ... but maybe I missed the memo about how to build for Hive? do I still need another Hive profile? Other tests, signatures, etc look good. On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit dd109a8): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-1109/ [published as version: 1.4.0-rc3] https://repository.apache.org/content/repositories/orgapachespark-1110/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Tuesday, June 02, at 00:32 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC1 == Below is a list of bug fixes that went into this RC: http://s.apache.org/vN == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Hive Context works on RC3 for Mapr after adding spark.sql.hive.metastore.sharedPrefixes as suggested in SPARK-7819 https://issues.apache.org/jira/browse/SPARK-7819. However, there still seems to be some other issues with native libraries, i get below warning WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable. I tried adding even after adding SPARK_LIBRARYPATH and --driver-library-path with no luck. Built on MacOSX and running CentOS 7 JDK1.6 and JDK 1.8 (tried both) make-distribution.sh --tgz --skip-java-test -Phive -Phive-0.13.1 -Pmapr4 -Pnetlib-lgpl -Phive-thriftserver. C On Mon, Jun 1, 2015 at 3:05 PM, Sean Owen so...@cloudera.com wrote: I get a bunch of failures in VersionSuite with build/test params -Pyarn -Phive -Phadoop-2.6: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) ... but maybe I missed the memo about how to build for Hive? do I still need another Hive profile? Other tests, signatures, etc look good. On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit dd109a8): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-1109/ [published as version: 1.4.0-rc3] https://repository.apache.org/content/repositories/orgapachespark-1110/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Tuesday, June 02, at 00:32 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC1 == Below is a list of bug fixes that went into this RC: http://s.apache.org/vN == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[SQL] Write parquet files under partition directories?
Hi there, I noticed in the latest Spark SQL programming guide https://spark.apache.org/docs/latest/sql-programming-guide.html , there is support for optimized reading of partitioned Parquet files that have a particular directory structure (year=1/month=10/day=3, for example). However, I see no analogous way to write DataFrames as Parquet files with similar directory structures based on user-provided partitioning. Generally, is it possible to write DataFrames as partitioned Parquet files that downstream partition discovery can take advantage of later? I considered extending the Parquet output format, but it looks like ParquetTableOperations.scala has fixed the output format to AppendingParquetOutputFormat. Also, I was wondering if it would be valuable to contribute writing Parquet in partition directories as a PR. Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Hey Bobby, Those are generic warnings that the hadoop libraries throw. If you are using MapRFS they shouldn't matter since you are using the MapR client and not the default hadoop client. Do you have any issues with functionality... or was it just seeing the warnings that was the concern? Thanks for helping test! - Patrick On Mon, Jun 1, 2015 at 5:18 PM, Bobby Chowdary bobby.chowdar...@gmail.com wrote: Hive Context works on RC3 for Mapr after adding spark.sql.hive.metastore.sharedPrefixes as suggested in SPARK-7819. However, there still seems to be some other issues with native libraries, i get below warning WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable. I tried adding even after adding SPARK_LIBRARYPATH and --driver-library-path with no luck. Built on MacOSX and running CentOS 7 JDK1.6 and JDK 1.8 (tried both) make-distribution.sh --tgz --skip-java-test -Phive -Phive-0.13.1 -Pmapr4 -Pnetlib-lgpl -Phive-thriftserver. C On Mon, Jun 1, 2015 at 3:05 PM, Sean Owen so...@cloudera.com wrote: I get a bunch of failures in VersionSuite with build/test params -Pyarn -Phive -Phadoop-2.6: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) ... but maybe I missed the memo about how to build for Hive? do I still need another Hive profile? Other tests, signatures, etc look good. On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit dd109a8): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-1109/ [published as version: 1.4.0-rc3] https://repository.apache.org/content/repositories/orgapachespark-1110/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Tuesday, June 02, at 00:32 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC1 == Below is a list of bug fixes that went into this RC: http://s.apache.org/vN == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [SQL] Write parquet files under partition directories?
There will be in 1.4. df.write.partitionBy(year, month, day).parquet(/path/to/output) On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote: Hi there, I noticed in the latest Spark SQL programming guide https://spark.apache.org/docs/latest/sql-programming-guide.html, there is support for optimized reading of partitioned Parquet files that have a particular directory structure (year=1/month=10/day=3, for example). However, I see no analogous way to write DataFrames as Parquet files with similar directory structures based on user-provided partitioning. Generally, is it possible to write DataFrames as partitioned Parquet files that downstream partition discovery can take advantage of later? I considered extending the Parquet output format, but it looks like ParquetTableOperations.scala has fixed the output format to AppendingParquetOutputFormat. Also, I was wondering if it would be valuable to contribute writing Parquet in partition directories as a PR. Thanks, -Matt Cheah
Re: [Streaming] Configure executor logging on Mesos
Hi Tim, (added dev, removed user) I've created https://issues.apache.org/jira/browse/SPARK-8009 to track this. -kr, Gerard. On Sat, May 30, 2015 at 7:10 PM, Tim Chen t...@mesosphere.io wrote: So sounds like some generic downloadable uris support can solve this problem, that Mesos automatically places in your sandbox and you can refer to it. If so please file a jira and this is a pretty simple fix on the Spark side. Tim On Sat, May 30, 2015 at 7:34 AM, andy petrella andy.petre...@gmail.com wrote: Hello, I'm currently exploring DCOS for the spark notebook, and while looking at the spark configuration I found something interesting which is actually converging to what we've discovered: https://github.com/mesosphere/universe/blob/master/repo/packages/S/spark/0/marathon.json So the logging is working fine here because the spark package is using the spark-class which is able to configure the log4j file. But the interesting part comes with the fact that the `uris` parameter is filled in with a downloadable path to the log4j file! However, it's not possible when creating the spark context ourselfves and relying on the mesos sheduler backend only. Unles the spark.executor.uri (or a another one) can take more than one downloadable path. my.2¢ andy On Fri, May 29, 2015 at 5:09 PM Gerard Maas gerard.m...@gmail.com wrote: Hi Tim, Thanks for the info. We (Andy Petrella and myself) have been diving a bit deeper into this log config: The log line I was referring to is this one (sorry, I provided the others just for context) *Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties* That line comes from Logging.scala [1] where a default config is loaded is none is found in the classpath upon the startup of the Spark Mesos executor in the Mesos sandbox. At that point in time, none of the application-specific resources have been shipped yet as the executor JVM is just starting up. To load a custom configuration file we should have it already on the sandbox before the executor JVM starts and add it to the classpath on the startup command. Is that correct? For the classpath customization, It looks like it should be possible to pass a -Dlog4j.configuration property by using the 'spark.executor.extraClassPath' that will be picked up at [2] and that should be added to the command that starts the executor JVM, but the resource must be already on the host before we can do that. Therefore we also need some means of 'shipping' the log4j.configuration file to the allocated executor. This all boils down to your statement on the need of shipping extra files to the sandbox. Bottom line: It's currently not possible to specify a config file for your mesos executor. (ours grows several GB/day). The only workaround I found so far is to open up the Spark assembly, replace the log4j-default.properties and pack it up again. That would work, although kind of rudimentary as we use the same assembly for many jobs. Probably, accessing the log4j API programmatically should also work (I didn't try that yet) Should we open a JIRA for this functionality? -kr, Gerard. [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Logging.scala#L128 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L77 On Thu, May 28, 2015 at 7:50 PM, Tim Chen t...@mesosphere.io wrote: -- Forwarded message -- From: Tim Chen t...@mesosphere.io Date: Thu, May 28, 2015 at 10:49 AM Subject: Re: [Streaming] Configure executor logging on Mesos To: Gerard Maas gerard.m...@gmail.com Hi Gerard, The log line you referred to is not Spark logging but Mesos own logging, which is using glog. Our own executor logs should only contain very few lines though. Most of the log lines you'll see is from Spark, and it can be controled by specifiying a log4j.properties to be downloaded with your Mesos task. Alternatively if you are downloading Spark executor via spark.executor.uri, you can include log4j.properties in that tar ball. I think we probably need some more configurations for Spark scheduler to pick up extra files to be downloaded into the sandbox. Tim On Thu, May 28, 2015 at 6:46 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, I'm trying to control the verbosity of the logs on the Mesos executors with no luck so far. The default behaviour is INFO on stderr dump with an unbounded growth that gets too big at some point. I noticed that when the executor is instantiated, it locates a default log configuration in the spark assembly: I0528 13:36:22.958067 26890 exec.cpp:206] Executor registered on slave 20150528-063307-780930314-5050-8152-S5 Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
Hi Patrick, Thanks for clarifying. No issues with functionality. +1 (non-binding) Thanks Bobby On Mon, Jun 1, 2015 at 9:41 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Bobby, Those are generic warnings that the hadoop libraries throw. If you are using MapRFS they shouldn't matter since you are using the MapR client and not the default hadoop client. Do you have any issues with functionality... or was it just seeing the warnings that was the concern? Thanks for helping test! - Patrick On Mon, Jun 1, 2015 at 5:18 PM, Bobby Chowdary bobby.chowdar...@gmail.com wrote: Hive Context works on RC3 for Mapr after adding spark.sql.hive.metastore.sharedPrefixes as suggested in SPARK-7819. However, there still seems to be some other issues with native libraries, i get below warning WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable. I tried adding even after adding SPARK_LIBRARYPATH and --driver-library-path with no luck. Built on MacOSX and running CentOS 7 JDK1.6 and JDK 1.8 (tried both) make-distribution.sh --tgz --skip-java-test -Phive -Phive-0.13.1 -Pmapr4 -Pnetlib-lgpl -Phive-thriftserver. C On Mon, Jun 1, 2015 at 3:05 PM, Sean Owen so...@cloudera.com wrote: I get a bunch of failures in VersionSuite with build/test params -Pyarn -Phive -Phadoop-2.6: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) ... but maybe I missed the memo about how to build for Hive? do I still need another Hive profile? Other tests, signatures, etc look good. On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit dd109a8): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-1109/ [published as version: 1.4.0-rc3] https://repository.apache.org/content/repositories/orgapachespark-1110/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Tuesday, June 02, at 00:32 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC1 == Below is a list of bug fixes that went into this RC: http://s.apache.org/vN == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org