Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread René Treffer
Hi,

I'm using sqlContext.jdbc(uri, table, where).map(_ =
1).aggregate(0)(_+_,_+_) on an interactive shell (where where is an
Array[String] of 32 to 48 elements).  (The code is tailored to your db,
specifically through the where conditions, I'd have otherwise post it)
That should be the DataFrame API, but I'm just trying to load everything
and discard it as soon as possible :-)

(1) Never do a silent drop of the values by default: it kills confidence.
An option sounds reasonable.  Some sort of insight / log would be great.
(How many columns of what type were truncated? why?)
Note that I could declare the field as string via JdbcDialects (thank you
guys for merging that :-) ).
I have quite bad experiences with silent drops / truncates of columns and
thus _like_ the strict way of spark. It causes trouble but noticing later
that your data was corrupted during conversion is even worse.

(2) SPARK-8004 https://issues.apache.org/jira/browse/SPARK-8004

(3) One option would be to make it safe to use, the other option would be
to document the behavior (s.th. like WARNING: this method tries to load as
many partitions as possible, make sure your database can handle the load or
load them in chunks and use union). SPARK-8008
https://issues.apache.org/jira/browse/SPARK-8008

Regards,
  Rene Treffer


Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread Reynold Xin
Never mind my comment about 3. You were talking about the read side, while
I was thinking about the write side. Your workaround actually is a pretty
good idea. Can you create a JIRA for that as well?

On Monday, June 1, 2015, Reynold Xin r...@databricks.com wrote:

 René,

 Thanks for sharing your experience. Are you using the DataFrame API or SQL?

 (1) Any recommendations on what we do w.r.t. out of range values? Should
 we silently turn them into a null? Maybe based on an option?

 (2) Looks like a good idea to always quote column names. The small tricky
 thing is each database seems to have its own unique quotes. Do you mind
 filing a JIRA for this?

 (3) It is somewhat hard to do because that'd require changing Spark's task
 scheduler. The easiest way maybe to coalesce it into a smaller number of
 partitions -- or we can coalesce for the user based on an option. Can you
 file a JIRA for this also?

 Thanks!










 On Mon, Jun 1, 2015 at 1:10 AM, René Treffer rtref...@gmail.com
 javascript:_e(%7B%7D,'cvml','rtref...@gmail.com'); wrote:

 Hi *,

 I used to run into a few problems with the jdbc/mysql integration and
 thought it would be nice to load our whole db, doing nothing but .map(_ =
 1).aggregate(0)(_+_,_+_) on the DataFrames.
 SparkSQL has to load all columns and process them so this should reveal
 type errors like
 SPARK-7897 Column with an unsigned bigint should be treated as
 DecimalType in JDBCRDD https://issues.apache.org/jira/browse/SPARK-7897
 SPARK-7697 https://issues.apache.org/jira/browse/SPARK-7697Column with
 an unsigned int should be treated as long in JDBCRDD

 The test was done on the 1.4 branch (checkout 2-3 days ago, local build,
 running standalone with a 350G heap).

 1. Date/Timestamp -00-00

 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 15 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in
 stage 18.0 (TID 186, localhost): java.sql.SQLException: Value '-00-00
 00:00:00' can not be represented as java.sql.Timestamp
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 26.0 (TID 636, localhost): java.sql.SQLException: Value '-00-00' can
 not be represented as java.sql.Date
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

 This was the most common error when I tried to load tables with
 Date/Timestamp types.
 Can be worked around by subqueries or by specifying those types to be
 string and handling them afterwards.

 2. Keywords as column names fail

 SparkSQL does not enclose column names, e.g.
 SELECT key,value FROM tablename
 fails and should be
 SELECT `key`,`value` FROM tablename

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 157.0 (TID 4322, localhost):
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an
 error in your SQL syntax; check the manual that corresponds to your MySQL
 server version for the right syntax to use near 'key,value FROM [XX]'

 I'm not sure how to work around that issue except for manually writing
 sub-queries (with all the performance problems that may cause).

 3. Overloaded DB due to concurrency

 While providing where clauses works well to parallelize the fetch it can
 overload the DB, thus causing trouble (e.g. query/connection timeouts due
 to an overloaded DB).
 It would be nice to specify fetch parallelism independent from result
 partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel).
 This can be emulated by loading just n partition at a time and doing a
 union afterwards. (gouped(5).map(...))

 Success

 I've successfully loaded 8'573'651'154 rows from 1667 tables ( 93%
 success rate). This is pretty awesome given that e.g. sqoop has failed
 horrible on the same data.
 Note that I didn't not verify that the retrieved data is valid. I've only
 checked for fetch errors so far. But given that spark does not silence any
 errors I'm quite confident that fetched data will be valid :-)

 Regards,
   Rene Treffer





spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread René Treffer
Hi *,

I used to run into a few problems with the jdbc/mysql integration and
thought it would be nice to load our whole db, doing nothing but .map(_ =
1).aggregate(0)(_+_,_+_) on the DataFrames.
SparkSQL has to load all columns and process them so this should reveal
type errors like
SPARK-7897 Column with an unsigned bigint should be treated as DecimalType
in JDBCRDD https://issues.apache.org/jira/browse/SPARK-7897
SPARK-7697 https://issues.apache.org/jira/browse/SPARK-7697Column with an
unsigned int should be treated as long in JDBCRDD

The test was done on the 1.4 branch (checkout 2-3 days ago, local build,
running standalone with a 350G heap).

1. Date/Timestamp -00-00

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage
18.0 (TID 186, localhost): java.sql.SQLException: Value '-00-00
00:00:00' can not be represented as java.sql.Timestamp
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage
26.0 (TID 636, localhost): java.sql.SQLException: Value '-00-00' can
not be represented as java.sql.Date
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

This was the most common error when I tried to load tables with
Date/Timestamp types.
Can be worked around by subqueries or by specifying those types to be
string and handling them afterwards.

2. Keywords as column names fail

SparkSQL does not enclose column names, e.g.
SELECT key,value FROM tablename
fails and should be
SELECT `key`,`value` FROM tablename

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage
157.0 (TID 4322, localhost):
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an
error in your SQL syntax; check the manual that corresponds to your MySQL
server version for the right syntax to use near 'key,value FROM [XX]'

I'm not sure how to work around that issue except for manually writing
sub-queries (with all the performance problems that may cause).

3. Overloaded DB due to concurrency

While providing where clauses works well to parallelize the fetch it can
overload the DB, thus causing trouble (e.g. query/connection timeouts due
to an overloaded DB).
It would be nice to specify fetch parallelism independent from result
partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel).
This can be emulated by loading just n partition at a time and doing a
union afterwards. (gouped(5).map(...))

Success

I've successfully loaded 8'573'651'154 rows from 1667 tables ( 93% success
rate). This is pretty awesome given that e.g. sqoop has failed horrible on
the same data.
Note that I didn't not verify that the retrieved data is valid. I've only
checked for fetch errors so far. But given that spark does not silence any
errors I'm quite confident that fetched data will be valid :-)

Regards,
  Rene Treffer


Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-06-01 Thread Steve Loughran
Is this backported to branch 1.3?

On 31 May 2015, at 00:44, Reynold Xin 
r...@databricks.commailto:r...@databricks.com wrote:

FYI we merged a patch that improves unit test log debugging. In order for that 
to work, all test suites have been changed to extend SparkFunSuite instead of 
ScalaTest's FunSuite. We also added a rule in the Scala style checker to fail 
Jenkins if FunSuite is used.

The patch that introduced SparkFunSuite: 
https://github.com/apache/spark/pull/6441

Style check patch: https://github.com/apache/spark/pull/6510




Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread Reynold Xin
René,

Thanks for sharing your experience. Are you using the DataFrame API or SQL?

(1) Any recommendations on what we do w.r.t. out of range values? Should we
silently turn them into a null? Maybe based on an option?

(2) Looks like a good idea to always quote column names. The small tricky
thing is each database seems to have its own unique quotes. Do you mind
filing a JIRA for this?

(3) It is somewhat hard to do because that'd require changing Spark's task
scheduler. The easiest way maybe to coalesce it into a smaller number of
partitions -- or we can coalesce for the user based on an option. Can you
file a JIRA for this also?

Thanks!










On Mon, Jun 1, 2015 at 1:10 AM, René Treffer rtref...@gmail.com wrote:

 Hi *,

 I used to run into a few problems with the jdbc/mysql integration and
 thought it would be nice to load our whole db, doing nothing but .map(_ =
 1).aggregate(0)(_+_,_+_) on the DataFrames.
 SparkSQL has to load all columns and process them so this should reveal
 type errors like
 SPARK-7897 Column with an unsigned bigint should be treated as DecimalType
 in JDBCRDD https://issues.apache.org/jira/browse/SPARK-7897
 SPARK-7697 https://issues.apache.org/jira/browse/SPARK-7697Column with
 an unsigned int should be treated as long in JDBCRDD

 The test was done on the 1.4 branch (checkout 2-3 days ago, local build,
 running standalone with a 350G heap).

 1. Date/Timestamp -00-00

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage
 18.0 (TID 186, localhost): java.sql.SQLException: Value '-00-00
 00:00:00' can not be represented as java.sql.Timestamp
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 26.0 (TID 636, localhost): java.sql.SQLException: Value '-00-00' can
 not be represented as java.sql.Date
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

 This was the most common error when I tried to load tables with
 Date/Timestamp types.
 Can be worked around by subqueries or by specifying those types to be
 string and handling them afterwards.

 2. Keywords as column names fail

 SparkSQL does not enclose column names, e.g.
 SELECT key,value FROM tablename
 fails and should be
 SELECT `key`,`value` FROM tablename

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 157.0 (TID 4322, localhost):
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an
 error in your SQL syntax; check the manual that corresponds to your MySQL
 server version for the right syntax to use near 'key,value FROM [XX]'

 I'm not sure how to work around that issue except for manually writing
 sub-queries (with all the performance problems that may cause).

 3. Overloaded DB due to concurrency

 While providing where clauses works well to parallelize the fetch it can
 overload the DB, thus causing trouble (e.g. query/connection timeouts due
 to an overloaded DB).
 It would be nice to specify fetch parallelism independent from result
 partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel).
 This can be emulated by loading just n partition at a time and doing a
 union afterwards. (gouped(5).map(...))

 Success

 I've successfully loaded 8'573'651'154 rows from 1667 tables ( 93%
 success rate). This is pretty awesome given that e.g. sqoop has failed
 horrible on the same data.
 Note that I didn't not verify that the retrieved data is valid. I've only
 checked for fetch errors so far. But given that spark does not silence any
 errors I'm quite confident that fetched data will be valid :-)

 Regards,
   Rene Treffer



GraphX: New graph operator

2015-06-01 Thread Tarek Auel
Hello,

Someone proposed in a Jira issue to implement new graph operations. Sean
Owen recommended to check first with the mailing list, if this is
interesting or not.

So I would like to know, if it is interesting for GraphX to implement the
operators like:
http://en.wikipedia.org/wiki/Graph_operations and/or
http://techieme.in/complex-graph-operations/

If yes, should they be integrated into GraphImpl (like mask, subgraph etc.)
or as external library? My feeling is that they are similar to mask.
Because of consistency they should be part of the graph implementation
itself.

What do you guys think? I really would like to bring GraphX forward and
help to implement some of these.

Looking forward to hear your opinions
Tarek


Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-06-01 Thread Reynold Xin
I don't think so.

On Monday, June 1, 2015, Steve Loughran ste...@hortonworks.com wrote:

  Is this backported to branch 1.3?

  On 31 May 2015, at 00:44, Reynold Xin r...@databricks.com
 javascript:_e(%7B%7D,'cvml','r...@databricks.com'); wrote:

  FYI we merged a patch that improves unit test log debugging. In order
 for that to work, all test suites have been changed to extend SparkFunSuite
 instead of ScalaTest's FunSuite. We also added a rule in the Scala style
 checker to fail Jenkins if FunSuite is used.

  The patch that introduced SparkFunSuite:
 https://github.com/apache/spark/pull/6441

  Style check patch: https://github.com/apache/spark/pull/6510





Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Peter Rudenko
Still have problem using HiveContext from sbt. Here’s an example of 
dependencies:


|val sparkVersion = 1.4.0-rc3 lazy val root = Project(id = 
spark-hive, base = file(.), settings = Project.defaultSettings ++ 
Seq( name := spark-1.4-hive, scalaVersion := 2.10.5, 
scalaBinaryVersion := 2.10, resolvers += Spark RC at 
https://repository.apache.org/content/repositories/orgapachespark-1110/;, 
libraryDependencies ++= Seq( org.apache.spark %% spark-core % 
sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, 
org.apache.spark %% spark-hive % sparkVersion, org.apache.spark %% 
spark-sql % sparkVersion ) )) |


Launching sbt console with it and running:

|val conf = new SparkConf().setMaster(local[4]).setAppName(test) val 
sc = new SparkContext(conf) val sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc) val data = sc.parallelize(1 to 
1) import sqlContext.implicits._ scala data.toDF 
java.lang.IllegalArgumentException: Unable to locate hive jars to 
connect to metastore using classloader 
scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set 
spark.sql.hive.metastore.jars at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206) 
at 
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175) at 
org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367) 
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367) 
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366) 
at 
org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379) 
at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379) 
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378) 
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901) 
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at 
org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474) at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456) at 
org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345) 
|


Thanks,
Peter Rudenko

On 2015-06-01 05:04, Guoqiang Li wrote:


+1 (non-binding)


-- Original --
*From: * Sandy Ryza;sandy.r...@cloudera.com;
*Date: * Mon, Jun 1, 2015 07:34 AM
*To: * Krishna Sankarksanka...@gmail.com;
*Cc: * Patrick Wendellpwend...@gmail.com; 
dev@spark.apache.orgdev@spark.apache.org;

*Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

+1 (non-binding)

Launched against a pseudo-distributed YARN cluster running Hadoop 
2.6.0 and ran some jobs.


-Sandy

On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar ksanka...@gmail.com 
mailto:ksanka...@gmail.com wrote:


+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -DskipTests
2. Tested pyspark, mlib - running as well as compare results with
1.3.1
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word
count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with
itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql(SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM
Orders INNER JOIN OrderDetails ON Orders.OrderID =
OrderDetails.OrderID) OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql(SELECT * from people WHERE State =
'WA') OK

Cheers
k/

On Fri, May 29, 2015 at 4:40 PM, Patrick Wendell
pwend...@gmail.com mailto:pwend...@gmail.com wrote:

Please vote on releasing the following candidate as Apache
Spark version 1.4.0!

The tag to be voted on is v1.4.0-rc3 (commit dd109a8):

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730

The release files, including signatures, digests, etc. can be
found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/

http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc3-bin/

Release artifacts are signed with the following key:

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Yin Huai
Hi Peter,

Based on your error message, seems you were not using the RC3. For the
error thrown at HiveContext's line 206, we have changed the message to this
one
https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207
just
before RC3. Basically, we will not print out the class loader name. Can you
check if a older version of 1.4 branch got used? Have you published a RC3
to your local maven repo? Can you clean your local repo cache and try again?

Thanks,

Yin

On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Still have problem using HiveContext from sbt. Here’s an example of
 dependencies:

  val sparkVersion = 1.4.0-rc3

 lazy val root = Project(id = spark-hive, base = file(.),
settings = Project.defaultSettings ++ Seq(
name := spark-1.4-hive,
scalaVersion := 2.10.5,
scalaBinaryVersion := 2.10,
resolvers += Spark RC at 
 https://repository.apache.org/content/repositories/orgapachespark-1110/; 
 https://repository.apache.org/content/repositories/orgapachespark-1110/,
libraryDependencies ++= Seq(
  org.apache.spark %% spark-core % sparkVersion,
  org.apache.spark %% spark-mllib % sparkVersion,
  org.apache.spark %% spark-hive % sparkVersion,
  org.apache.spark %% spark-sql % sparkVersion
 )

   ))

 Launching sbt console with it and running:

 val conf = new SparkConf().setMaster(local[4]).setAppName(test)
 val sc = new SparkContext(conf)
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 val data = sc.parallelize(1 to 1)
 import sqlContext.implicits._
 scala data.toDF
 java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
 metastore using classloader 
 scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set 
 spark.sql.hive.metastore.jars
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206)
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367)
 at 
 org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367)
 at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379)
 at 
 org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379)
 at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474)
 at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456)
 at 
 org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345)

 Thanks,
 Peter Rudenko

 On 2015-06-01 05:04, Guoqiang Li wrote:

   +1 (non-binding)


  -- Original --
  *From: * Sandy Ryza;sandy.r...@cloudera.com sandy.r...@cloudera.com
 ;
 *Date: * Mon, Jun 1, 2015 07:34 AM
 *To: * Krishna Sankarksanka...@gmail.com ksanka...@gmail.com;
 *Cc: * Patrick Wendellpwend...@gmail.com pwend...@gmail.com;
 dev@spark.apache.org dev@spark.apache.orgdev@spark.apache.org
 dev@spark.apache.org;
 *Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

  +1 (non-binding)

  Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0
 and ran some jobs.

  -Sandy

 On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar  ksanka...@gmail.com
 ksanka...@gmail.com wrote:

  +1 (non-binding, of course)

  1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min
  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
 -Dhadoop.version=2.6.0 -DskipTests
 2. Tested pyspark, mlib - running as well as compare results with 1.3.1
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) 

Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-06-01 Thread Andrew Or
It will be within the next few days

2015-06-01 9:17 GMT-07:00 Reynold Xin r...@databricks.com:

 I don't think so.


 On Monday, June 1, 2015, Steve Loughran ste...@hortonworks.com wrote:

  Is this backported to branch 1.3?

  On 31 May 2015, at 00:44, Reynold Xin r...@databricks.com wrote:

  FYI we merged a patch that improves unit test log debugging. In order
 for that to work, all test suites have been changed to extend SparkFunSuite
 instead of ScalaTest's FunSuite. We also added a rule in the Scala style
 checker to fail Jenkins if FunSuite is used.

  The patch that introduced SparkFunSuite:
 https://github.com/apache/spark/pull/6441

  Style check patch: https://github.com/apache/spark/pull/6510





Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Andrew Or
+1 (binding)

Tested the standalone cluster mode REST submission gateway - submit /
status / kill
Tested simple applications on YARN client / cluster modes with and without
--jars
Tested python applications on YARN client / cluster modes with and without
--py-files*
Tested dynamic allocation on YARN client / cluster modes

All good.

*Filed SPARK-8017: not a blocker because python in YARN cluster mode is a
new feature


2015-06-01 11:10 GMT-07:00 Yin Huai yh...@databricks.com:

 Hi Peter,

 Based on your error message, seems you were not using the RC3. For the
 error thrown at HiveContext's line 206, we have changed the message to this
 one
 https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207
  just
 before RC3. Basically, we will not print out the class loader name. Can you
 check if a older version of 1.4 branch got used? Have you published a RC3
 to your local maven repo? Can you clean your local repo cache and try again?

 Thanks,

 Yin

 On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko petro.rude...@gmail.com
 wrote:

  Still have problem using HiveContext from sbt. Here’s an example of
 dependencies:

  val sparkVersion = 1.4.0-rc3

 lazy val root = Project(id = spark-hive, base = file(.),
settings = Project.defaultSettings ++ Seq(
name := spark-1.4-hive,
scalaVersion := 2.10.5,
scalaBinaryVersion := 2.10,
resolvers += Spark RC at 
 https://repository.apache.org/content/repositories/orgapachespark-1110/; 
 https://repository.apache.org/content/repositories/orgapachespark-1110/,
libraryDependencies ++= Seq(
  org.apache.spark %% spark-core % sparkVersion,
  org.apache.spark %% spark-mllib % sparkVersion,
  org.apache.spark %% spark-hive % sparkVersion,
  org.apache.spark %% spark-sql % sparkVersion
 )

   ))

 Launching sbt console with it and running:

 val conf = new SparkConf().setMaster(local[4]).setAppName(test)
 val sc = new SparkContext(conf)
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 val data = sc.parallelize(1 to 1)
 import sqlContext.implicits._
 scala data.toDF
 java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
 metastore using classloader 
 scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set 
 spark.sql.hive.metastore.jars
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206)
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367)
 at 
 org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367)
 at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379)
 at 
 org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379)
 at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474)
 at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456)
 at 
 org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345)

 Thanks,
 Peter Rudenko

 On 2015-06-01 05:04, Guoqiang Li wrote:

   +1 (non-binding)


  -- Original --
  *From: * Sandy Ryza;sandy.r...@cloudera.com
 sandy.r...@cloudera.com;
 *Date: * Mon, Jun 1, 2015 07:34 AM
 *To: * Krishna Sankarksanka...@gmail.com ksanka...@gmail.com;
 *Cc: * Patrick Wendellpwend...@gmail.com pwend...@gmail.com;
 dev@spark.apache.org dev@spark.apache.orgdev@spark.apache.org
 dev@spark.apache.org;
 *Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

  +1 (non-binding)

  Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0
 and ran some jobs.

  -Sandy

 On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar  ksanka...@gmail.com
 ksanka...@gmail.com wrote:

  +1 (non-binding, of course)

  1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min
  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
 -Dhadoop.version=2.6.0 -DskipTests
 2. Tested pyspark, mlib - running as well as compare results with 1.3.1
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. 

Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

2015-06-01 Thread Reynold Xin
Thanks, René. I actually added a warning to the new JDBC reader/writer
interface for 1.4.0.

Even with that, I think we should support throttling JDBC; otherwise it's
too convenient for our users to DOS their production database servers!


  /**
   * Construct a [[DataFrame]] representing the database table accessible
via JDBC URL
   * url named table. Partitions of the table will be retrieved in parallel
based on the parameters
   * passed to this function.
   *
*   * Don't create too many partitions in parallel on a large cluster;
otherwise Spark might crash*
*   * your external database systems.*
   *
   * @param url JDBC database url of the form `jdbc:subprotocol:subname`
   * @param table Name of the table in the external database.
   * @param columnName the name of a column of integral type that will be
used for partitioning.
   * @param lowerBound the minimum value of `columnName` used to decide
partition stride
   * @param upperBound the maximum value of `columnName` used to decide
partition stride
   * @param numPartitions the number of partitions.  the range
`minValue`-`maxValue` will be split
   *  evenly into this many partitions
   * @param connectionProperties JDBC database connection arguments, a list
of arbitrary string
   * tag/value. Normally at least a user and
password property
   * should be included.
   *
   * @since 1.4.0
   */


On Mon, Jun 1, 2015 at 1:54 AM, René Treffer rtref...@gmail.com wrote:

 Hi,

 I'm using sqlContext.jdbc(uri, table, where).map(_ =
 1).aggregate(0)(_+_,_+_) on an interactive shell (where where is an
 Array[String] of 32 to 48 elements).  (The code is tailored to your db,
 specifically through the where conditions, I'd have otherwise post it)
 That should be the DataFrame API, but I'm just trying to load everything
 and discard it as soon as possible :-)

 (1) Never do a silent drop of the values by default: it kills confidence.
 An option sounds reasonable.  Some sort of insight / log would be great.
 (How many columns of what type were truncated? why?)
 Note that I could declare the field as string via JdbcDialects (thank you
 guys for merging that :-) ).
 I have quite bad experiences with silent drops / truncates of columns and
 thus _like_ the strict way of spark. It causes trouble but noticing later
 that your data was corrupted during conversion is even worse.

 (2) SPARK-8004 https://issues.apache.org/jira/browse/SPARK-8004

 (3) One option would be to make it safe to use, the other option would be
 to document the behavior (s.th. like WARNING: this method tries to load
 as many partitions as possible, make sure your database can handle the load
 or load them in chunks and use union). SPARK-8008
 https://issues.apache.org/jira/browse/SPARK-8008

 Regards,
   Rene Treffer



Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Peter Rudenko
Thanks Yin, tried on a clean VM - works now. But tests in my app still 
fails:


|[info] Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a 
test connection to the given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
Terminating connection pool (set lazyInit to true if you expect to start 
your database after your app). Original Exception: -- [info] 
java.sql.SQLException: Failed to start database 'metastore_db' with 
class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$anon$1@380628de, 
see the next exception for details. [info] at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
Source) [info] at 
org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source) 
[info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown 
Source) [info] at 
org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) 
[info] at org.apache.derby.impl.jdbc.EmbedConnection.init(Unknown 
Source) [info] at 
org.apache.derby.impl.jdbc.EmbedConnection40.init(Unknown Source) 
[info] at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown 
Source) [info] at org.apache.derby.jdbc.InternalDriver.connect(Unknown 
Source) [info] at org.apache.derby.jdbc.Driver20.connect(Unknown Source) 
[info] at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) 
[info] at java.sql.DriverManager.getConnection(DriverManager.java:571) 
[info] at java.sql.DriverManager.getConnection(DriverManager.java:187) 
[info] at 
com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361) 
[info] at com.jolbox.bonecp.BoneCP.init(BoneCP.java:416) [info] at 
com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120) 
|


I’ve set

parallelExecution in Test := false,

Thanks,
Peter Rudenko

On 2015-06-01 21:10, Yin Huai wrote:


Hi Peter,

Based on your error message, seems you were not using the RC3. For the 
error thrown at HiveContext's line 206, we have changed the message to 
this one 
https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207 just 
before RC3. Basically, we will not print out the class loader name. 
Can you check if a older version of 1.4 branch got used? Have you 
published a RC3 to your local maven repo? Can you clean your local 
repo cache and try again?


Thanks,

Yin

On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko 
petro.rude...@gmail.com mailto:petro.rude...@gmail.com wrote:


Still have problem using HiveContext from sbt. Here’s an example
of dependencies:

|val sparkVersion = 1.4.0-rc3 lazy val root = Project(id =
spark-hive, base = file(.), settings = Project.defaultSettings
++ Seq( name := spark-1.4-hive, scalaVersion := 2.10.5,
scalaBinaryVersion := 2.10, resolvers += Spark RC at
https://repository.apache.org/content/repositories/orgapachespark-1110/;
https://repository.apache.org/content/repositories/orgapachespark-1110/,
libraryDependencies ++= Seq( org.apache.spark %% spark-core %
sparkVersion, org.apache.spark %% spark-mllib % sparkVersion,
org.apache.spark %% spark-hive % sparkVersion,
org.apache.spark %% spark-sql % sparkVersion ) )) |

Launching sbt console with it and running:

|val conf = new
SparkConf().setMaster(local[4]).setAppName(test) val sc = new
SparkContext(conf) val sqlContext = new
org.apache.spark.sql.hive.HiveContext(sc) val data =
sc.parallelize(1 to 1) import sqlContext.implicits._ scala
data.toDF java.lang.IllegalArgumentException: Unable to locate
hive jars to connect to metastore using classloader
scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please
set spark.sql.hive.metastore.jars at

org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206)
at
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
at
org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367)
at

org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367)
at
org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366)
at
org.apache.spark.sql.hive.HiveContext$anon$1.init(HiveContext.scala:379)
at

org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:379)
at
org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:378)
at

org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at
org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:474)
at
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:456)
at

org.apache.spark.sql.SQLContext$implicits$.intRddToDataFrameHolder(SQLContext.scala:345)
|

Thanks,
Peter Rudenko

  

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Michael Armbrust
Its no longer valid to start more than one instance of HiveContext in a
single JVM, as one of the goals of this refactoring was to allow connection
to more than one metastore from a single context.

For tests I suggest you use TestHive as we do in our unit tests.  It has a
reset() method you can use to cleanup state between tests/suites.

We could also add an explicit close() method to remove this restriction,
but if thats something you want to investigate we should move this off the
vote thread and onto JIRA.

On Tue, Jun 2, 2015 at 7:19 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Thanks Yin, tried on a clean VM - works now. But tests in my app still
 fails:

 [info]   Cause: javax.jdo.JDOFatalDataStoreException: Unable to open a test 
 connection to the given database. JDBC url = 
 jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
 Terminating connection pool (set lazyInit to true if you expect to start your 
 database after your app). Original Exception: --
 [info] java.sql.SQLException: Failed to start database 'metastore_db' with 
 class loader 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$anon$1@380628de, see 
 the next exception for details.
 [info] at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
 [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.EmbedConnection.init(Unknown 
 Source)
 [info] at org.apache.derby.impl.jdbc.EmbedConnection40.init(Unknown 
 Source)
 [info] at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown 
 Source)
 [info] at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
 [info] at org.apache.derby.jdbc.Driver20.connect(Unknown Source)
 [info] at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
 [info] at java.sql.DriverManager.getConnection(DriverManager.java:571)
 [info] at java.sql.DriverManager.getConnection(DriverManager.java:187)
 [info] at 
 com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
 [info] at com.jolbox.bonecp.BoneCP.init(BoneCP.java:416)
 [info] at 
 com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)

 I’ve set

 parallelExecution in Test := false,

 Thanks,
 Peter Rudenko

 On 2015-06-01 21:10, Yin Huai wrote:

   Hi Peter,

  Based on your error message, seems you were not using the RC3. For the
 error thrown at HiveContext's line 206, we have changed the message to this
 one
 https://github.com/apache/spark/blob/v1.4.0-rc3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L205-207
  just
 before RC3. Basically, we will not print out the class loader name. Can you
 check if a older version of 1.4 branch got used? Have you published a RC3
 to your local maven repo? Can you clean your local repo cache and try again?

  Thanks,

  Yin

 On Mon, Jun 1, 2015 at 10:45 AM, Peter Rudenko  petro.rude...@gmail.com
 petro.rude...@gmail.com wrote:

  Still have problem using HiveContext from sbt. Here’s an example of
 dependencies:

  val sparkVersion = 1.4.0-rc3

 lazy val root = Project(id = spark-hive, base = file(.),
settings = Project.defaultSettings ++ Seq(
name := spark-1.4-hive,
scalaVersion := 2.10.5,
scalaBinaryVersion := 2.10,
resolvers += Spark RC at 
 https://repository.apache.org/content/repositories/orgapachespark-1110/; 
 https://repository.apache.org/content/repositories/orgapachespark-1110/,
libraryDependencies ++= Seq(
  org.apache.spark %% spark-core % sparkVersion,
  org.apache.spark %% spark-mllib % sparkVersion,
  org.apache.spark %% spark-hive % sparkVersion,
  org.apache.spark %% spark-sql % sparkVersion
 )

   ))

 Launching sbt console with it and running:

 val conf = new SparkConf().setMaster(local[4]).setAppName(test)
 val sc = new SparkContext(conf)
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 val data = sc.parallelize(1 to 1)
 import sqlContext.implicits._
 scala data.toDF
 java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
 metastore using classloader 
 scala.tools.nsc.interpreter.IMain$TranslatingClassLoader. Please set 
 spark.sql.hive.metastore.jars
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:206)
 at 
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
 at 
 org.apache.spark.sql.hive.HiveContext$anon$2.init(HiveContext.scala:367)
 at 
 org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:367)
 at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:366)
 at 
 

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Sean Owen
I get a bunch of failures in VersionSuite with build/test params
-Pyarn -Phive -Phadoop-2.6:

- success sanity check *** FAILED ***
  java.lang.RuntimeException: [download failed:
org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
commons-net#commons-net;3.1!commons-net.jar]
  at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)

... but maybe I missed the memo about how to build for Hive? do I
still need another Hive profile?

Other tests, signatures, etc look good.

On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.0!

 The tag to be voted on is v1.4.0-rc3 (commit dd109a8):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-1109/
 [published as version: 1.4.0-rc3]
 https://repository.apache.org/content/repositories/orgapachespark-1110/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Tuesday, June 02, at 00:32 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What has changed since RC1 ==
 Below is a list of bug fixes that went into this RC:
 http://s.apache.org/vN

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.4 QA period,
 so -1 votes should only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Bobby Chowdary
Hive Context works on RC3 for Mapr after adding
spark.sql.hive.metastore.sharedPrefixes as suggested in SPARK-7819
https://issues.apache.org/jira/browse/SPARK-7819. However, there still
seems to be some other issues with native libraries, i get below warning
WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable. I tried adding
even after adding SPARK_LIBRARYPATH and --driver-library-path with no luck.

Built on MacOSX and running CentOS 7 JDK1.6 and JDK 1.8 (tried both)

 make-distribution.sh --tgz --skip-java-test -Phive -Phive-0.13.1 -Pmapr4
-Pnetlib-lgpl -Phive-thriftserver.
  C​

On Mon, Jun 1, 2015 at 3:05 PM, Sean Owen so...@cloudera.com wrote:

 I get a bunch of failures in VersionSuite with build/test params
 -Pyarn -Phive -Phadoop-2.6:

 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed:
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
 commons-net#commons-net;3.1!commons-net.jar]
   at
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)

 ... but maybe I missed the memo about how to build for Hive? do I
 still need another Hive profile?

 Other tests, signatures, etc look good.

 On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.4.0!
 
  The tag to be voted on is v1.4.0-rc3 (commit dd109a8):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.0]
  https://repository.apache.org/content/repositories/orgapachespark-1109/
  [published as version: 1.4.0-rc3]
  https://repository.apache.org/content/repositories/orgapachespark-1110/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Tuesday, June 02, at 00:32 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What has changed since RC1 ==
  Below is a list of bug fixes that went into this RC:
  http://s.apache.org/vN
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.3 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.4 QA period,
  so -1 votes should only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




[SQL] Write parquet files under partition directories?

2015-06-01 Thread Matt Cheah
Hi there,

I noticed in the latest Spark SQL programming guide
https://spark.apache.org/docs/latest/sql-programming-guide.html , there is
support for optimized reading of partitioned Parquet files that have a
particular directory structure (year=1/month=10/day=3, for example).
However, I see no analogous way to write DataFrames as Parquet files with
similar directory structures based on user-provided partitioning.

Generally, is it possible to write DataFrames as partitioned Parquet files
that downstream partition discovery can take advantage of later? I
considered extending the Parquet output format, but it looks like
ParquetTableOperations.scala has fixed the output format to
AppendingParquetOutputFormat.

Also, I was wondering if it would be valuable to contribute writing Parquet
in partition directories as a PR.

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature


Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Patrick Wendell
Hey Bobby,

Those are generic warnings that the hadoop libraries throw. If you are
using MapRFS they shouldn't matter since you are using the MapR client
and not the default hadoop client.

Do you have any issues with functionality... or was it just seeing the
warnings that was the concern?

Thanks for helping test!

- Patrick

On Mon, Jun 1, 2015 at 5:18 PM, Bobby Chowdary
bobby.chowdar...@gmail.com wrote:
 Hive Context works on RC3 for Mapr after adding
 spark.sql.hive.metastore.sharedPrefixes as suggested in SPARK-7819. However,
 there still seems to be some other issues with native libraries, i get below
 warning
 WARN NativeCodeLoader: Unable to load native-hadoop library for your
 platform... using builtin-java classes where applicable. I tried adding even
 after adding SPARK_LIBRARYPATH and --driver-library-path with no luck.

 Built on MacOSX and running CentOS 7 JDK1.6 and JDK 1.8 (tried both)

  make-distribution.sh --tgz --skip-java-test -Phive -Phive-0.13.1 -Pmapr4
 -Pnetlib-lgpl -Phive-thriftserver.

   C

 On Mon, Jun 1, 2015 at 3:05 PM, Sean Owen so...@cloudera.com wrote:

 I get a bunch of failures in VersionSuite with build/test params
 -Pyarn -Phive -Phadoop-2.6:

 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed:
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
 commons-net#commons-net;3.1!commons-net.jar]
   at
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)

 ... but maybe I missed the memo about how to build for Hive? do I
 still need another Hive profile?

 Other tests, signatures, etc look good.

 On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
  1.4.0!
 
  The tag to be voted on is v1.4.0-rc3 (commit dd109a8):
 
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.0]
  https://repository.apache.org/content/repositories/orgapachespark-1109/
  [published as version: 1.4.0-rc3]
  https://repository.apache.org/content/repositories/orgapachespark-1110/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Tuesday, June 02, at 00:32 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What has changed since RC1 ==
  Below is a list of bug fixes that went into this RC:
  http://s.apache.org/vN
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.3 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.4 QA period,
  so -1 votes should only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SQL] Write parquet files under partition directories?

2015-06-01 Thread Reynold Xin
There will be in 1.4.

df.write.partitionBy(year, month, day).parquet(/path/to/output)

On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote:

 Hi there,

 I noticed in the latest Spark SQL programming guide
 https://spark.apache.org/docs/latest/sql-programming-guide.html, there
 is support for optimized reading of partitioned Parquet files that have a
 particular directory structure (year=1/month=10/day=3, for example).
 However, I see no analogous way to write DataFrames as Parquet files with
 similar directory structures based on user-provided partitioning.

 Generally, is it possible to write DataFrames as partitioned Parquet files
 that downstream partition discovery can take advantage of later? I
 considered extending the Parquet output format, but it looks like
 ParquetTableOperations.scala has fixed the output format to
 AppendingParquetOutputFormat.

 Also, I was wondering if it would be valuable to contribute writing
 Parquet in partition directories as a PR.

 Thanks,

 -Matt Cheah



Re: [Streaming] Configure executor logging on Mesos

2015-06-01 Thread Gerard Maas
Hi Tim,
(added dev, removed user)

I've created https://issues.apache.org/jira/browse/SPARK-8009 to track this.

-kr, Gerard.

On Sat, May 30, 2015 at 7:10 PM, Tim Chen t...@mesosphere.io wrote:

 So sounds like some generic downloadable uris support can solve this
 problem, that Mesos automatically places in your sandbox and you can refer
 to it.

 If so please file a jira and this is a pretty simple fix on the Spark side.

 Tim

 On Sat, May 30, 2015 at 7:34 AM, andy petrella andy.petre...@gmail.com
 wrote:

 Hello,

 I'm currently exploring DCOS for the spark notebook, and while looking at
 the spark configuration I found something interesting which is actually
 converging to what we've discovered:

 https://github.com/mesosphere/universe/blob/master/repo/packages/S/spark/0/marathon.json

 So the logging is working fine here because the spark package is using
 the spark-class which is able to configure the log4j file. But the
 interesting part comes with the fact that the `uris` parameter is filled in
 with a downloadable path to the log4j file!

 However, it's not possible when creating the spark context ourselfves and
 relying on  the mesos sheduler backend only. Unles the spark.executor.uri
 (or a another one) can take more than one downloadable path.

 my.2¢

 andy


 On Fri, May 29, 2015 at 5:09 PM Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi Tim,

 Thanks for the info.   We (Andy Petrella and myself) have been diving a
 bit deeper into this log config:

 The log line I was referring to is this one (sorry, I provided the
 others just for context)

 *Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties*

 That line comes from Logging.scala [1] where a default config is loaded
 is none is found in the classpath upon the startup of the Spark Mesos
 executor in the Mesos sandbox. At that point in time, none of the
 application-specific resources have been shipped yet as the executor JVM is
 just starting up.   To load a custom configuration file we should have it
 already on the sandbox before the executor JVM starts and add it to the
 classpath on the startup command. Is that correct?

 For the classpath customization, It looks like it should be possible to
 pass a -Dlog4j.configuration  property by using the
 'spark.executor.extraClassPath' that will be picked up at [2] and that
 should be added to the command that starts the executor JVM, but the
 resource must be already on the host before we can do that. Therefore we
 also need some means of 'shipping' the log4j.configuration file to the
 allocated executor.

 This all boils down to your statement on the need of shipping extra
 files to the sandbox. Bottom line: It's currently not possible to specify a
 config file for your mesos executor. (ours grows several GB/day).

 The only workaround I found so far is to open up the Spark assembly,
 replace the log4j-default.properties and pack it up again.  That would
 work, although kind of rudimentary as we use the same assembly for many
 jobs.  Probably, accessing the log4j API programmatically should also work
 (I didn't try that yet)

 Should we open a JIRA for this functionality?

 -kr, Gerard.




 [1]
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Logging.scala#L128
 [2]
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L77

 On Thu, May 28, 2015 at 7:50 PM, Tim Chen t...@mesosphere.io wrote:


 -- Forwarded message --
 From: Tim Chen t...@mesosphere.io
 Date: Thu, May 28, 2015 at 10:49 AM
 Subject: Re: [Streaming] Configure executor logging on Mesos
 To: Gerard Maas gerard.m...@gmail.com


 Hi Gerard,

 The log line you referred to is not Spark logging but Mesos own
 logging, which is using glog.

 Our own executor logs should only contain very few lines though.

 Most of the log lines you'll see is from Spark, and it can be controled
 by specifiying a log4j.properties to be downloaded with your Mesos task.
 Alternatively if you are downloading Spark executor via spark.executor.uri,
 you can include log4j.properties in that tar ball.

 I think we probably need some more configurations for Spark scheduler
 to pick up extra files to be downloaded into the sandbox.

 Tim





 On Thu, May 28, 2015 at 6:46 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 I'm trying to control the verbosity of the logs on the Mesos executors
 with no luck so far. The default behaviour is INFO on stderr dump with an
 unbounded growth that gets too big at some point.

 I noticed that when the executor is instantiated, it locates a default
 log configuration in the spark assembly:

 I0528 13:36:22.958067 26890 exec.cpp:206] Executor registered on slave
 20150528-063307-780930314-5050-8152-S5
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties

 

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-01 Thread Bobby Chowdary
Hi Patrick,
  Thanks for clarifying. No issues with functionality.
+1 (non-binding)

Thanks
Bobby

On Mon, Jun 1, 2015 at 9:41 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Bobby,

 Those are generic warnings that the hadoop libraries throw. If you are
 using MapRFS they shouldn't matter since you are using the MapR client
 and not the default hadoop client.

 Do you have any issues with functionality... or was it just seeing the
 warnings that was the concern?

 Thanks for helping test!

 - Patrick

 On Mon, Jun 1, 2015 at 5:18 PM, Bobby Chowdary
 bobby.chowdar...@gmail.com wrote:
  Hive Context works on RC3 for Mapr after adding
  spark.sql.hive.metastore.sharedPrefixes as suggested in SPARK-7819.
 However,
  there still seems to be some other issues with native libraries, i get
 below
  warning
  WARN NativeCodeLoader: Unable to load native-hadoop library for your
  platform... using builtin-java classes where applicable. I tried adding
 even
  after adding SPARK_LIBRARYPATH and --driver-library-path with no luck.
 
  Built on MacOSX and running CentOS 7 JDK1.6 and JDK 1.8 (tried both)
 
   make-distribution.sh --tgz --skip-java-test -Phive -Phive-0.13.1 -Pmapr4
  -Pnetlib-lgpl -Phive-thriftserver.
 
C
 
  On Mon, Jun 1, 2015 at 3:05 PM, Sean Owen so...@cloudera.com wrote:
 
  I get a bunch of failures in VersionSuite with build/test params
  -Pyarn -Phive -Phadoop-2.6:
 
  - success sanity check *** FAILED ***
java.lang.RuntimeException: [download failed:
  org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
  commons-net#commons-net;3.1!commons-net.jar]
at
 
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
 
  ... but maybe I missed the memo about how to build for Hive? do I
  still need another Hive profile?
 
  Other tests, signatures, etc look good.
 
  On Sat, May 30, 2015 at 12:40 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   Please vote on releasing the following candidate as Apache Spark
 version
   1.4.0!
  
   The tag to be voted on is v1.4.0-rc3 (commit dd109a8):
  
  
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730
  
   The release files, including signatures, digests, etc. can be found
 at:
  
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
   [published as version: 1.4.0]
  
 https://repository.apache.org/content/repositories/orgapachespark-1109/
   [published as version: 1.4.0-rc3]
  
 https://repository.apache.org/content/repositories/orgapachespark-1110/
  
   The documentation corresponding to this release can be found at:
  
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/
  
   Please vote on releasing this package as Apache Spark 1.4.0!
  
   The vote is open until Tuesday, June 02, at 00:32 UTC and passes
   if a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.4.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == What has changed since RC1 ==
   Below is a list of bug fixes that went into this RC:
   http://s.apache.org/vN
  
   == How can I help test this release? ==
   If you are a Spark user, you can help us test this release by
   taking a Spark 1.3 workload and running on this release candidate,
   then reporting any regressions.
  
   == What justifies a -1 vote for this release? ==
   This vote is happening towards the end of the 1.4 QA period,
   so -1 votes should only occur for significant regressions from 1.3.1.
   Bugs already present in 1.3.X, minor regressions, or bugs related
   to new features will not block this release.
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org