date:20160408

[jira] [Created] (SPARK-14482) Change default compression codec for Parquet from gzip to snappy

2016-04-08 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-14482:
---

 Summary: Change default compression codec for Parquet from gzip to 
snappy
 Key: SPARK-14482
 URL: https://issues.apache.org/jira/browse/SPARK-14482
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Based on our tests, gzip decompression is very slow (< 100MB/s), making queries 
decompression bound. Snappy can decompress at ~ 500MB/s on a single core.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14482) Change default compression codec for Parquet from gzip to snappy

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14482:


Assignee: Reynold Xin  (was: Apache Spark)

> Change default compression codec for Parquet from gzip to snappy
> 
>
> Key: SPARK-14482
> URL: https://issues.apache.org/jira/browse/SPARK-14482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Based on our tests, gzip decompression is very slow (< 100MB/s), making 
> queries decompression bound. Snappy can decompress at ~ 500MB/s on a single 
> core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14482) Change default compression codec for Parquet from gzip to snappy

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14482:


Assignee: Apache Spark  (was: Reynold Xin)

> Change default compression codec for Parquet from gzip to snappy
> 
>
> Key: SPARK-14482
> URL: https://issues.apache.org/jira/browse/SPARK-14482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Based on our tests, gzip decompression is very slow (< 100MB/s), making 
> queries decompression bound. Snappy can decompress at ~ 500MB/s on a single 
> core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14482) Change default compression codec for Parquet from gzip to snappy

2016-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231767#comment-15231767
 ] 

Apache Spark commented on SPARK-14482:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12256

> Change default compression codec for Parquet from gzip to snappy
> 
>
> Key: SPARK-14482
> URL: https://issues.apache.org/jira/browse/SPARK-14482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Based on our tests, gzip decompression is very slow (< 100MB/s), making 
> queries decompression bound. Snappy can decompress at ~ 500MB/s on a single 
> core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14483) Display user name for each job and query

2016-04-08 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-14483:
--

 Summary: Display user name for each job and query
 Key: SPARK-14483
 URL: https://issues.apache.org/jira/browse/SPARK-14483
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Kousuke Saruta


SPARK-14245 introduced displaying user name who runs jobs on ALLJobsPage.   
  
User names are pulled from the system property "user.name" but the property 
means who runs an application and each jobs in a n application can run by 
different users.

 
One example is the case of using ThriftServer.  
  
ThriftServer is run by a user but queries are not always submitted by the user. 
   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14484) Fail to create parquet filter if the column name does not match exactly

2016-04-08 Thread Davies Liu (JIRA)

Davies Liu created SPARK-14484:
--

 Summary: Fail to create parquet filter if the column name does not 
match exactly
 Key: SPARK-14484
 URL: https://issues.apache.org/jira/browse/SPARK-14484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


There will be exception about "no key found" from ParquetFilters.createFilter()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14483) Display user name for each job and query

2016-04-08 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-14483:
---
Description: 
SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage.   
  
User names are pulled from the system property "user.name" but the property 
means who runs an application and each jobs in a n application can run by 
different users.

 
One example is the case of using ThriftServer.  
  
ThriftServer is run by a user but queries are not always submitted by the user. 
   

  was:
SPARK-14245 introduced displaying user name who runs jobs on ALLJobsPage.   
  
User names are pulled from the system property "user.name" but the property 
means who runs an application and each jobs in a n application can run by 
different users.

 
One example is the case of using ThriftServer.  
  
ThriftServer is run by a user but queries are not always submitted by the user. 
   


> Display user name for each job and query
> 
>
> Key: SPARK-14483
> URL: https://issues.apache.org/jira/browse/SPARK-14483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage. 
>   
>   
> User names are pulled from the system property "user.name" but the property 
> means who runs an application and each jobs in a n application can run by 
> different users.  
>   
>  
> One example is the case of using ThriftServer.
>   
>   
> ThriftServer is run by a user but queries are not always submitted by the 
> user. 
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14483) Display user name for each job and query

2016-04-08 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-14483:
---
Description: 
SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage.   
  
User names are pulled from the system property "user.name" but the property 
means who runs an application and each jobs in a n application can be run by 
different users.

 
One example is the case of using ThriftServer.  
  
ThriftServer is run by a user but queries are not always submitted by the user. 
   

  was:
SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage.   
  
User names are pulled from the system property "user.name" but the property 
means who runs an application and each jobs in a n application can run by 
different users.

 
One example is the case of using ThriftServer.  
  
ThriftServer is run by a user but queries are not always submitted by the user. 
   


> Display user name for each job and query
> 
>
> Key: SPARK-14483
> URL: https://issues.apache.org/jira/browse/SPARK-14483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage. 
>   
>   
> User names are pulled from the system property "user.name" but the property 
> means who runs an application and each jobs in a n application can be run by 
> different users.  
>   
>  
> One example is the case of using ThriftServer.
>   
>   
> ThriftServer is run by a user but queries are not always submitted by the 
> user. 
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14103.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>Assignee: Hyukjin Kwon
>  Labels: csv, csvparser, dataframe, pyspark
> Fix For: 2.0.0
>
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
> aborting job
> ^M[Stage 1:>

[jira] [Updated] (SPARK-14189) JSON data source infers a field type as StringType when some are inferred as DecimalType not capable of IntegralType.

2016-04-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14189:

Assignee: Hyukjin Kwon  (was: Dongjoon Hyun)

> JSON data source infers a field type as StringType when some are inferred as 
> DecimalType not capable of IntegralType.
> -
>
> Key: SPARK-14189
> URL: https://issues.apache.org/jira/browse/SPARK-14189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> When inferred types in the same field during finding competible {{DataType}} 
> are {{IntegralType}} and {{DecimalType}} but {{DecimalType}} is not capable 
> of the given {{IntegralType}}, JSON data source simply parses this as 
> {{StringType}}.
> This can be observed when {{floatAsBigDecimal}} is enabled.
> {code}
> def mixedIntegerAndDoubleRecords: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 3, "b": 1.1}""" ::
> """{"a": 3.1, "b": 1}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(mixedIntegerAndDoubleRecords)
>   .printSchema()
> {code}
> produces below:
> {code}
> root
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
> {code}
>  When {{floatAsBigDecimal}} is disabled.
> {code}
> def mixedIntegerAndDoubleRecords: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 3, "b": 1.1}""" ::
> """{"a": 3.1, "b": 1}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "false")
>   .json(mixedIntegerAndDoubleRecords)
>   .printSchema()
> {code}
> produces below correctly:
> {code}
> root
>  |-- a: double (nullable = true)
>  |-- b: double (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14189) JSON data source infers a field type as StringType when some are inferred as DecimalType not capable of IntegralType.

2016-04-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14189.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> JSON data source infers a field type as StringType when some are inferred as 
> DecimalType not capable of IntegralType.
> -
>
> Key: SPARK-14189
> URL: https://issues.apache.org/jira/browse/SPARK-14189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> When inferred types in the same field during finding competible {{DataType}} 
> are {{IntegralType}} and {{DecimalType}} but {{DecimalType}} is not capable 
> of the given {{IntegralType}}, JSON data source simply parses this as 
> {{StringType}}.
> This can be observed when {{floatAsBigDecimal}} is enabled.
> {code}
> def mixedIntegerAndDoubleRecords: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 3, "b": 1.1}""" ::
> """{"a": 3.1, "b": 1}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(mixedIntegerAndDoubleRecords)
>   .printSchema()
> {code}
> produces below:
> {code}
> root
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
> {code}
>  When {{floatAsBigDecimal}} is disabled.
> {code}
> def mixedIntegerAndDoubleRecords: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 3, "b": 1.1}""" ::
> """{"a": 3.1, "b": 1}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "false")
>   .json(mixedIntegerAndDoubleRecords)
>   .printSchema()
> {code}
> produces below correctly:
> {code}
> root
>  |-- a: double (nullable = true)
>  |-- b: double (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14485) Task finished cause fetch failure when its executor has already removed by driver

2016-04-08 Thread iward (JIRA)

iward created SPARK-14485:
-

 Summary: Task finished cause fetch failure when its executor has 
already removed by driver 
 Key: SPARK-14485
 URL: https://issues.apache.org/jira/browse/SPARK-14485
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2, 1.3.1
Reporter: iward


Now, when executor is removed by driver with heartbeats timeout, driver will 
re-queue the task on this executor and send a kill command to cluster to kill 
this executor.
But, in a situation, the running task of this executor is finished and return 
result to driver before this executor killed by kill command sent by driver. At 
this situation, driver will accept the task finished event and ignore  
speculative task and re-queued task. But, as we know, this executor has removed 
by driver, the result of this finished task can not save in driver because the 
*BlockManagerId* has also removed from *BlockManagerMaster* by driver. So, the 
result data of this stage is not complete, and then, it will cause fetch 
failure.

For example, the following is the task log:
{noformat}
2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
256015 ms
2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
tasks for 322 from TaskSet 107.0
2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 229.0 
in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
ExecutorLostFailure (executor 322 lost)
2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
322 (epoch 11)
2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
Trying to remove executor 322 from BlockManagerMaster.
2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 322 
successfully in removeExecutor
{noformat}

{noformat}
2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
229.0 in stage 107.0 (TID 10384) in 272315 ms on 
BJHC-HERA-16168.hadoop.jd.local (579/700)
2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
task-finished event for 229.1 in stage 107.0 because task 229 has already 
completed successfully
{noformat}

{noformat}
2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at mapPartitions 
at Exchange.scala:137)
2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task set 
107.1 with 3 tasks
2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, PROCESS_LOCAL, 
3745 bytes)
2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
3745 bytes)
2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, PROCESS_LOCAL, 
3745 bytes)
{noformat}

Driver will check the stage's result is not complete, and submit missing task, 
but this time, the next stage has run because previous stage has finish for its 
task is all finished although its result is not complete.

{noformat}
2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 39.0 
in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): FetchFailed(null, 
shuffleId=11, mapId=-1, reduceId=39, message=
2015-12-31 04:40:13 INFO org.apache.spark.shuffle.MetadataFetchFailedException: 
Missing an output location for shuffle 11
2015-12-31 04:40:13 INFO at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
2015-12-31 04:40:13 INFO at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
2015-12-31 04:40:13 INFO at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
2015-12-31 04:40:13 INFO at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
2015-12-31 04:40:13 INFO at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
2015-12-31 04:40:13 INFO at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
2015-12-31 04:40:13 INFO at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
2015-12-31 04:40:13 INFO at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
2015-12-31 04:40:13 INFO at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
2015-1

[jira] [Assigned] (SPARK-14483) Display user name for each job and query

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14483:


Assignee: Apache Spark

> Display user name for each job and query
> 
>
> Key: SPARK-14483
> URL: https://issues.apache.org/jira/browse/SPARK-14483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage. 
>   
>   
> User names are pulled from the system property "user.name" but the property 
> means who runs an application and each jobs in a n application can be run by 
> different users.  
>   
>  
> One example is the case of using ThriftServer.
>   
>   
> ThriftServer is run by a user but queries are not always submitted by the 
> user. 
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14483) Display user name for each job and query

2016-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231787#comment-15231787
 ] 

Apache Spark commented on SPARK-14483:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/12257

> Display user name for each job and query
> 
>
> Key: SPARK-14483
> URL: https://issues.apache.org/jira/browse/SPARK-14483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage. 
>   
>   
> User names are pulled from the system property "user.name" but the property 
> means who runs an application and each jobs in a n application can be run by 
> different users.  
>   
>  
> One example is the case of using ThriftServer.
>   
>   
> ThriftServer is run by a user but queries are not always submitted by the 
> user. 
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14483) Display user name for each job and query

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14483:


Assignee: (was: Apache Spark)

> Display user name for each job and query
> 
>
> Key: SPARK-14483
> URL: https://issues.apache.org/jira/browse/SPARK-14483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage. 
>   
>   
> User names are pulled from the system property "user.name" but the property 
> means who runs an application and each jobs in a n application can be run by 
> different users.  
>   
>  
> One example is the case of using ThriftServer.
>   
>   
> ThriftServer is run by a user but queries are not always submitted by the 
> user. 
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14483) Display user name for each job and query

2016-04-08 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-14483:
---
Description: 
SPARK-14245 introduced a feature displaying user name who runs jobs on 
AllJobsPage.
 
User names are pulled from the system property "user.name" but the property 
means who runs an application and each jobs in a n application can be run by 
different users.

 
One example is the case of using ThriftServer.  
  
ThriftServer is run by a user but queries are not always submitted by the user. 
   

  was:
SPARK-14245 introduced displaying user name who runs jobs on AllJobsPage.   
  
User names are pulled from the system property "user.name" but the property 
means who runs an application and each jobs in a n application can be run by 
different users.

 
One example is the case of using ThriftServer.  
  
ThriftServer is run by a user but queries are not always submitted by the user. 
   


> Display user name for each job and query
> 
>
> Key: SPARK-14483
> URL: https://issues.apache.org/jira/browse/SPARK-14483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> SPARK-14245 introduced a feature displaying user name who runs jobs on 
> AllJobsPage.  
>
> User names are pulled from the system property "user.name" but the property 
> means who runs an application and each jobs in a n application can be run by 
> different users.  
>   
>  
> One example is the case of using ThriftServer.
>   
>   
> ThriftServer is run by a user but queries are not always submitted by the 
> user. 
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14485) Task finished cause fetch failure when its executor has already removed by driver

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14485:


Assignee: (was: Apache Spark)

> Task finished cause fetch failure when its executor has already removed by 
> driver 
> --
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 2015-12-31 04:40:13 INFO at 
> sca

[jira] [Assigned] (SPARK-14485) Task finished cause fetch failure when its executor has already removed by driver

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14485:


Assignee: Apache Spark

> Task finished cause fetch failure when its executor has already removed by 
> driver 
> --
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: Apache Spark
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 2015-12-3

[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already removed by driver

2016-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231822#comment-15231822
 ] 

Apache Spark commented on SPARK-14485:
--

User 'zhonghaihua' has created a pull request for this issue:
https://github.com/apache/spark/pull/12258

> Task finished cause fetch failure when its executor has already removed by 
> driver 
> --
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
>

[jira] [Updated] (SPARK-9926) Parallelize file listing for partitioned Hive table

2016-04-08 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9926:
---
Assignee: Ryan Blue  (was: Cheolsoo Park)

> Parallelize file listing for partitioned Hive table
> ---
>
> Key: SPARK-9926
> URL: https://issues.apache.org/jira/browse/SPARK-9926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Ryan Blue
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very 
> slowly against partitioned Hive tables because of file listing. In 
> particular, if a large number of partitions are scanned on storage like S3, 
> the queries run extremely slowly. Here are some example benchmarks in my 
> environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
> 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and 
> dateint<=20150610 limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
> partition path and group them into a UnionRDD. Then, all the input files are 
> listed sequentially. In other tools such as Hive and Pig, this can be solved 
> by setting 
> [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
>  high. But in Spark, since each HadoopRDD lists only one partition path, 
> setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231860#comment-15231860
 ] 

Nick Pentreath commented on SPARK-13944:


It seems like the current approach is to keep all the mllib code but with 
converting things back and forth. This then seems to completely duplicate the 
linalg API between {{ml}} and {{mllib}}, and anything added to {{ml}} in future 
must be added with a conversion to {{mllib}}? This seems very error-prone and 
onerous to me.

We mention that the new ML API will use the new {{ml.linalg}} vectors, which 
may be a breaking change, but it seems like we're trying not to break mllib 
APIs? To me Spark 2.0 is the only time for 2 years that we can break things, so 
is it not better to just have everything use {{ml.linalg}} and take the pain 
upfront, to make things cleaner and more maintainable going forward? 

(By the way, I've started exploring whether we can minimize the impact using 
type aliases - it seems like it might work ok in Scala, but still breaks Java. 
Also not yet sure if it solves the UDT problem at all).

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-08 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231871#comment-15231871
 ] 

DB Tsai commented on SPARK-14408:
-

I remember that when we implemented the scaler, we had similar discussion. We 
ended up following R's scale package which is unbiased std. How about we add 
extra flag to StandardScaler to make it biased but default to unbiased? 

I remember that when I implemented LOR/LiR in Spark, there are packages in R 
using unbiased std to scale the features, and most of the time, when the # of 
samples are huge, this will not be a concern. So I ended up just using the 
standard scaler in mllib.

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-08 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231883#comment-15231883
 ] 

DB Tsai commented on SPARK-13944:
-

I'm going to close this PR. Gonna work on 
https://issues.apache.org/jira/browse/SPARK-14462 first. 

I think we will just keep the `mllib` code untouched, and will not maintain it 
anymore. We'll copy the code into `ml` package, and all the further development 
will be on new `ml` package. As a result, those two packages will not depend on 
each other, and it's easier to maintain.

For your second point, we'll keep all the `mllib` code untouched, but users who 
is using the new `ml` code have to migrate to the new `ml.linalg`. We do think 
about using type aliases, and decide not to do it for the same Java 
compatibility reason :)

For UDT, in ScalaReflection.scala line 700, instead of the following

[code]
val udt = Utils.classForName(className)
  .getAnnotation(classOf[SQLUserDefinedType]).udt().newInstance()
[/code]

we can add different way to get udt without using annotation. 

Thanks.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-08 Thread meiyoula (JIRA)

meiyoula created SPARK-14486:


 Summary: For partition table, the dag occurs oom because of too 
many same rdds
 Key: SPARK-14486
 URL: https://issues.apache.org/jira/browse/SPARK-14486
 Project: Spark
  Issue Type: Improvement
Reporter: meiyoula


For partition table, when partition rdds do some maps, the rdd number will 
multiple grow. So rdd number in dag will become thounds, and occurs oom.

Can we make a improvement to reduce the rdd number in dag. show the same rdds 
just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-08 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231920#comment-15231920
 ] 

meiyoula commented on SPARK-14486:
--

[~andrewor14]

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Improvement
>Reporter: meiyoula
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thounds, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-08 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231920#comment-15231920
 ] 

meiyoula edited comment on SPARK-14486 at 4/8/16 9:25 AM:
--

[~andrewor14] Can you have a look on this?  thanks.


was (Author: meiyoula):
[~andrewor14]

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Improvement
>Reporter: meiyoula
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thounds, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14433) PySpark ml GaussianMixture

2016-04-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231922#comment-15231922
 ] 

Nick Pentreath commented on SPARK-14433:


Go ahead

> PySpark ml GaussianMixture
> --
>
> Key: SPARK-14433
> URL: https://issues.apache.org/jira/browse/SPARK-14433
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Python wrapper for GaussianMixture in spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231926#comment-15231926
 ] 

Sean Owen commented on SPARK-14486:
---

[~meiyoula] this sounds like you want to initiate a discussion. I'm not clear 
what you're concretely proposing here. JIRAs are really for when you can 
describe a specific change and reasons for it.

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Improvement
>Reporter: meiyoula
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thounds, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14475) Propagate user-defined context from driver to executors

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231932#comment-15231932
 ] 

Sean Owen commented on SPARK-14475:
---

How is this different from, say, broadcast variables?

> Propagate user-defined context from driver to executors
> ---
>
> Key: SPARK-14475
> URL: https://issues.apache.org/jira/browse/SPARK-14475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>
> It would be useful (e.g. for tracing) to automatically propagate arbitrary 
> user defined context (i.e. thread-locals) from the driver to executors. We 
> can do this easily by adding sc.localProperties to TaskContext.
> cc [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-14403) the dag of a stage may has too many same chid cluster, and result to gc

2016-04-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-14403:
---

I don't see anything was resolved; at best this just wasn't a JIRA at the moment

> the dag of a stage may has too many same chid cluster, and result to gc
> ---
>
> Key: SPARK-14403
> URL: https://issues.apache.org/jira/browse/SPARK-14403
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
>
> When I run a sql query, I can't open the stage page on the web, and result 
> the historyserver process shut down. 
> After debug the code, I find a stage graph has more than 5000 same 
> childcluster. so when going to make dot file, the process goes down.
> I think the graph cluster shouldn't has the same child cluster, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14403) the dag of a stage may has too many same chid cluster, and result to gc

2016-04-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14403.
---
Resolution: Invalid

> the dag of a stage may has too many same chid cluster, and result to gc
> ---
>
> Key: SPARK-14403
> URL: https://issues.apache.org/jira/browse/SPARK-14403
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
>
> When I run a sql query, I can't open the stage page on the web, and result 
> the historyserver process shut down. 
> After debug the code, I find a stage graph has more than 5000 same 
> childcluster. so when going to make dot file, the process goes down.
> I think the graph cluster shouldn't has the same child cluster, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-08 Thread meiyoula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-14486:
-
Issue Type: Bug  (was: Improvement)

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thounds, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-08 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231942#comment-15231942
 ] 

meiyoula commented on SPARK-14486:
--

I think it's a bug, when there are too many partition rdds, the dag can't be 
opened.

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thounds, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231948#comment-15231948
 ] 

Nick Pentreath commented on SPARK-13944:


What's the reasoning behind breaking changes in {{ml}} API and not in 
{{mllib}}? It seems to me that if we're breaking one API, we may as well break 
both, and make a clean break rather than keep a bunch of essentially deprecated 
cruft around (though I guess we could deprecate in 2.0 and remove in say 2.2, 
2.3). If we broke explicitly without trying to "half-maintain" back compat, 
it's also very clear to everyone what's broken. While converting back and forth 
may be more error prone in the long run.

Also, in practice the actual breaking change is mostly for (a) 3rd party 
developers developing their own models and Pipeline components; (b) users 
creating input datasets (data -> {{LabeledPoint}} or {{Vector}} in {{mllib}} 
API, or creating raw {{Vector}} DataFrame columns or working with udfs over 
{{Vector}}s in {{ml}} API). Across the board the only change required is simply 
replacing {{mllib}} with {{ml}} in the imports.

Now, if we use the type alias, Scala users don't even need to make that change! 
As for Java users, ALL of them need to change {{DataFrame}} -> {{Dataset}} 
in ALL their code. Changing imports for linalg components from {{mllib}} -> 
{{ml}} seems as onerous (or as "not very onerous").

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14487) User Defined Type registration without SQLUserDefinedType annotation

2016-04-08 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-14487:
---

 Summary: User Defined Type registration without SQLUserDefinedType 
annotation
 Key: SPARK-14487
 URL: https://issues.apache.org/jira/browse/SPARK-14487
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh


Currently we use SQLUserDefinedType annotation to register UDTs for user 
classes. However, by doing this, we add Spark dependency to user classes.

For some user classes, it is unnecessary to add such dependency that will 
increase deployment difficulty.

We should provide alternative approach to register UDTs for user classes 
without SQLUserDefinedType annotation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232006#comment-15232006
 ] 

Sean Owen commented on SPARK-13944:
---

I can understand the idea of being able to use the simple API classes (vectors, 
models, etc) as a separable module. However, the theory behind classes like 
{{Vector}} was that they're just Spark-specific wrappers, and not something 
you'd use outside the context of a Spark app. The model classes likewise depend 
on Spark classes like {{RDD}} since they're primarily there to score 
Spark-specific representations of data.

For general interchange, I'd assume one would use general representations -- 
PMML, JSON, etc. And then for scoring not related to Spark, use libraries that 
already exist for this purpose like JPMML.

I still understand that, well, it seems funny to have this Spark 
NaiveBayesModel that can score a single input vector but tell people they can't 
use it unless they drag spark-core into their app. It does mean you're pushing 
Spark towards also becoming a general non-distributed model representation and 
scoring library. Right now, it isn't that, and taking steps to advertise its 
half-baked state as such seem problematic. Same with PMML support -- seems like 
it's better than nothing, but supporting a little PMML has turned out to be 
almost worse than not at all.

Maybe I'm over-thinking this and this is mostly about the {{.linalg}} classes? 
while it's kind of unfortunate that these internal wrapper classes have 
"leaked", that cat may be out of the bag. I recognize that's already a problem 
in that you have to depend on all of mllib to use VectorUDT for example.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232009#comment-15232009
 ] 

Sean Owen commented on SPARK-14486:
---

You need to make the JIRA much clearer with more detail, since I didn't 
understand that was your proposal. Why does a map make thousands of partitions? 
how would you decide how to limit the number of RDDs in the DAG?

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thounds, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14487) User Defined Type registration without SQLUserDefinedType annotation

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14487:


Assignee: (was: Apache Spark)

> User Defined Type registration without SQLUserDefinedType annotation
> 
>
> Key: SPARK-14487
> URL: https://issues.apache.org/jira/browse/SPARK-14487
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liang-Chi Hsieh
>
> Currently we use SQLUserDefinedType annotation to register UDTs for user 
> classes. However, by doing this, we add Spark dependency to user classes.
> For some user classes, it is unnecessary to add such dependency that will 
> increase deployment difficulty.
> We should provide alternative approach to register UDTs for user classes 
> without SQLUserDefinedType annotation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14487) User Defined Type registration without SQLUserDefinedType annotation

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14487:


Assignee: Apache Spark

> User Defined Type registration without SQLUserDefinedType annotation
> 
>
> Key: SPARK-14487
> URL: https://issues.apache.org/jira/browse/SPARK-14487
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Currently we use SQLUserDefinedType annotation to register UDTs for user 
> classes. However, by doing this, we add Spark dependency to user classes.
> For some user classes, it is unnecessary to add such dependency that will 
> increase deployment difficulty.
> We should provide alternative approach to register UDTs for user classes 
> without SQLUserDefinedType annotation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14487) User Defined Type registration without SQLUserDefinedType annotation

2016-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232010#comment-15232010
 ] 

Apache Spark commented on SPARK-14487:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/12259

> User Defined Type registration without SQLUserDefinedType annotation
> 
>
> Key: SPARK-14487
> URL: https://issues.apache.org/jira/browse/SPARK-14487
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liang-Chi Hsieh
>
> Currently we use SQLUserDefinedType annotation to register UDTs for user 
> classes. However, by doing this, we add Spark dependency to user classes.
> For some user classes, it is unnecessary to add such dependency that will 
> increase deployment difficulty.
> We should provide alternative approach to register UDTs for user classes 
> without SQLUserDefinedType annotation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14488) Creating temporary table using SQL DDL shouldn't write files to file system

2016-04-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-14488:
--

 Summary: Creating temporary table using SQL DDL shouldn't write 
files to file system
 Key: SPARK-14488
 URL: https://issues.apache.org/jira/browse/SPARK-14488
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


The following Spark shell snippet shows that currently temporary table creation 
writes files to file system:

{code}
sqlContext range 10 registerTempTable "t"
sqlContext sql "create temporary table s using parquet as select * from t"
{code}

The problematic code is 
[here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table

2016-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232028#comment-15232028
 ] 

Apache Spark commented on SPARK-9926:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8512

> Parallelize file listing for partitioned Hive table
> ---
>
> Key: SPARK-9926
> URL: https://issues.apache.org/jira/browse/SPARK-9926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Ryan Blue
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very 
> slowly against partitioned Hive tables because of file listing. In 
> particular, if a large number of partitions are scanned on storage like S3, 
> the queries run extremely slowly. Here are some example benchmarks in my 
> environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
> 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and 
> dateint<=20150610 limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
> partition path and group them into a UnionRDD. Then, all the input files are 
> listed sequentially. In other tools such as Hive and Pig, this can be solved 
> by setting 
> [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
>  high. But in Spark, since each HadoopRDD lists only one partition path, 
> setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14488) Creating temporary table using SQL DDL shouldn't write files to file system

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232030#comment-15232030
 ] 

Cheng Lian edited comment on SPARK-14488 at 4/8/16 11:10 AM:
-

Oh, wait... Since there's a {{USING}} in the DDL statement, are we supposed to 
write the query result using given data source format on disk, and use written 
files to create a temporary table? So basically this DDL is used to save a 
query result using a specific data source format to disk? I find this one quite 
confusing...

cc [~yhuai] [~marmbrus]


was (Author: lian cheng):
Oh, wait... Since there's a {{USING}} in the DDL statement, are we supposed to 
write the query result using given data source format on disk, and use written 
files to create a temporary table? So basically this DDL is used to save a 
query result using a specific data source format to disk? I find this one quite 
confusing...

> Creating temporary table using SQL DDL shouldn't write files to file system
> ---
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet shows that currently temporary table 
> creation writes files to file system:
> {code}
> sqlContext range 10 registerTempTable "t"
> sqlContext sql "create temporary table s using parquet as select * from t"
> {code}
> The problematic code is 
> [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14488) Creating temporary table using SQL DDL shouldn't write files to file system

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232030#comment-15232030
 ] 

Cheng Lian commented on SPARK-14488:


Oh, wait... Since there's a {{USING}} in the DDL statement, are we supposed to 
write the query result using given data source format on disk, and use written 
files to create a temporary table? So basically this DDL is used to save a 
query result using a specific data source format to disk? I find this one quite 
confusing...

> Creating temporary table using SQL DDL shouldn't write files to file system
> ---
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet shows that currently temporary table 
> creation writes files to file system:
> {code}
> sqlContext range 10 registerTempTable "t"
> sqlContext sql "create temporary table s using parquet as select * from t"
> {code}
> The problematic code is 
> [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231948#comment-15231948
 ] 

Nick Pentreath edited comment on SPARK-13944 at 4/8/16 11:09 AM:
-

What's the reasoning behind breaking changes in {{ml}} API and not in 
{{mllib}}? It seems to me that if we're breaking one API, we may as well break 
both, and make a clean break rather than keep a bunch of essentially deprecated 
cruft around (though I guess we could deprecate in 2.0 and remove in say 2.2, 
2.3). If we broke explicitly without trying to "half-maintain" back compat, 
it's also very clear to everyone what's broken. While converting back and forth 
may be more error prone in the long run.

Also, in practice the actual breaking change is mostly for (a) 3rd party 
developers developing their own models and Pipeline components; (b) users 
creating input datasets (data -> {{LabeledPoint}} or {{Vector}} in {{mllib}} 
API, or creating raw {{Vector}} DataFrame columns or working with udfs over 
{{Vector}} in {{ml}} API). Across the board the only change required is simply 
replacing {{mllib}} with {{ml}} in the imports.

Now, if we use the type alias, Scala users don't even need to make that change! 
As for Java users, ALL of them need to change {{DataFrame}} -> {{Dataset}} 
in ALL their code. Changing imports for linalg components from {{mllib}} -> 
{{ml}} seems as onerous (or as "not very onerous").


was (Author: mlnick):
What's the reasoning behind breaking changes in {{ml}} API and not in 
{{mllib}}? It seems to me that if we're breaking one API, we may as well break 
both, and make a clean break rather than keep a bunch of essentially deprecated 
cruft around (though I guess we could deprecate in 2.0 and remove in say 2.2, 
2.3). If we broke explicitly without trying to "half-maintain" back compat, 
it's also very clear to everyone what's broken. While converting back and forth 
may be more error prone in the long run.

Also, in practice the actual breaking change is mostly for (a) 3rd party 
developers developing their own models and Pipeline components; (b) users 
creating input datasets (data -> {{LabeledPoint}} or {{Vector}} in {{mllib}} 
API, or creating raw {{Vector}} DataFrame columns or working with udfs over 
{{Vector}}s in {{ml}} API). Across the board the only change required is simply 
replacing {{mllib}} with {{ml}} in the imports.

Now, if we use the type alias, Scala users don't even need to make that change! 
As for Java users, ALL of them need to change {{DataFrame}} -> {{Dataset}} 
in ALL their code. Changing imports for linalg components from {{mllib}} -> 
{{ml}} seems as onerous (or as "not very onerous").

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2016-04-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10340.
---
Resolution: Duplicate

Given the last comment let's fold this into SPARK-9926

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread JIRA

Boris Clémençon  created SPARK-14489:


 Summary: RegressionEvaluator returns NaN for ALS in Spark ml
 Key: SPARK-14489
 URL: https://issues.apache.org/jira/browse/SPARK-14489
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.0
 Environment: AWS EMR
Reporter: Boris Clémençon 


When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-14489:
-
Description: 
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
  val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
  val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
  // multi-model training
  logDebug(s"Train split $splitIndex with multiple sets of parameters.")
  val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
  trainingDataset.unpersist()
  var i = 0
  while (i < numModels) {
// TODO: duplicate evaluator to take extra params from input
val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
  }
  validationDataset.unpersist()
}

  was:
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.


> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14490) spark.yarn.am.attemptFailuresValidityInterval doesn't work

2016-04-08 Thread Fu Chen (JIRA)

Fu Chen created SPARK-14490:
---

 Summary: spark.yarn.am.attemptFailuresValidityInterval doesn't work
 Key: SPARK-14490
 URL: https://issues.apache.org/jira/browse/SPARK-14490
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.0
Reporter: Fu Chen


we want yarn to supervise our spark streaming process, and than we set config 
of spark.yarn.am.attemptFailuresValidityInterval, but it not work for us



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-14489:
-
Description: 
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

Issue SPARK-14153 seems to the same pbm

val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
  val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
  val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
  // multi-model training
  logDebug(s"Train split $splitIndex with multiple sets of parameters.")
  val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
  trainingDataset.unpersist()
  var i = 0
  while (i < numModels) {
// TODO: duplicate evaluator to take extra params from input
val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
  }
  validationDataset.unpersist()
}

  was:
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
  val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
  val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
  // multi-model training
  logDebug(s"Train split $splitIndex with multiple sets of parameters.")
  val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
  trainingDataset.unpersist()
  var i = 0
  while (i < numModels) {
// TODO: duplicate evaluator to take extra params from input
val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
  }
  validationDataset.unpersist()
}


> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to the same pbm
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters

[jira] [Commented] (SPARK-14153) My dataset does not provide proper predictions in ALS

2016-04-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232057#comment-15232057
 ] 

Boris Clémençon  commented on SPARK-14153:
--

Hi.

SPARK-14489 is maybe related to this issue. In short, Spark-ml is making 
K-Folds to alternatively train one against the other. When the dataset is 
sparse, there is a great chance that a user in the validation set is missing in 
the training set, hence producing NaN while transform is applied (and a 
fortiori NaN metrics)

Hope this help
B

> My dataset does not provide proper predictions in ALS
> -
>
> Key: SPARK-14153
> URL: https://issues.apache.org/jira/browse/SPARK-14153
> Project: Spark
>  Issue Type: Question
>  Components: Java API, ML
>Reporter: Dulaj Rajitha
>
> When I used data-set in the git-hub example, I get proper predictions. But 
> when I used my data set It does not predict well. (I has a large RMSE). 
> I used cross validator for ALS  (in Spark ML) and here are the best model 
> parameters.
> 16/03/25 12:03:06 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN)
> 16/03/25 12:03:06 INFO CrossValidator: Best set of parameters:
> {
>   als_c911c0e183a3-alpha: 0.02,
>   als_c911c0e183a3-rank: 500,
>   als_c911c0e183a3-regParam: 0.03
> }
> But when I used movie data set It gives proper values for parameters. as below
> 16/03/24 14:07:07 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(1.9481584447713676, 2.0501457159728944, 2.0600857505406935, 
> 1.9457234533860048, 2.0494498583414282, 2.0595306613827002, 
> 1.9488322049918922, 2.0489573853226797, 2.0584252131752, 1.9464006741621391, 
> 2.048241271354197, 2.057853990227443)
> 16/03/24 14:07:07 INFO CrossValidator: Best set of parameters:
> {
>   als_31a605e7717b-alpha: 0.02,
>   als_31a605e7717b-rank: 1,
>   als_31a605e7717b-regParam: 0.02
> }
> 16/03/24 14:07:07 INFO CrossValidator: Best cross-validation metric: 
> 1.9457234533860048.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14153) My dataset does not provide proper predictions in ALS

2016-04-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232057#comment-15232057
 ] 

Boris Clémençon  edited comment on SPARK-14153 at 4/8/16 11:43 AM:
---

Hi.

SPARK-14489 is maybe related to this issue. In short, Spark-ml is making 
K-Folds to alternatively train two and test against the other. When the dataset 
is sparse, there is a great chance that a user in the validation set is missing 
in the training set, hence producing NaN while transform is applied (and a 
fortiori NaN metrics)

Hope this help
B


was (Author: clemencb):
Hi.

SPARK-14489 is maybe related to this issue. In short, Spark-ml is making 
K-Folds to alternatively train one against the other. When the dataset is 
sparse, there is a great chance that a user in the validation set is missing in 
the training set, hence producing NaN while transform is applied (and a 
fortiori NaN metrics)

Hope this help
B

> My dataset does not provide proper predictions in ALS
> -
>
> Key: SPARK-14153
> URL: https://issues.apache.org/jira/browse/SPARK-14153
> Project: Spark
>  Issue Type: Question
>  Components: Java API, ML
>Reporter: Dulaj Rajitha
>
> When I used data-set in the git-hub example, I get proper predictions. But 
> when I used my data set It does not predict well. (I has a large RMSE). 
> I used cross validator for ALS  (in Spark ML) and here are the best model 
> parameters.
> 16/03/25 12:03:06 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN)
> 16/03/25 12:03:06 INFO CrossValidator: Best set of parameters:
> {
>   als_c911c0e183a3-alpha: 0.02,
>   als_c911c0e183a3-rank: 500,
>   als_c911c0e183a3-regParam: 0.03
> }
> But when I used movie data set It gives proper values for parameters. as below
> 16/03/24 14:07:07 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(1.9481584447713676, 2.0501457159728944, 2.0600857505406935, 
> 1.9457234533860048, 2.0494498583414282, 2.0595306613827002, 
> 1.9488322049918922, 2.0489573853226797, 2.0584252131752, 1.9464006741621391, 
> 2.048241271354197, 2.057853990227443)
> 16/03/24 14:07:07 INFO CrossValidator: Best set of parameters:
> {
>   als_31a605e7717b-alpha: 0.02,
>   als_31a605e7717b-rank: 1,
>   als_31a605e7717b-regParam: 0.02
> }
> 16/03/24 14:07:07 INFO CrossValidator: Best cross-validation metric: 
> 1.9457234533860048.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14470) Allow for overriding both httpclient and httpcore versions

2016-04-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14470.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12245
[https://github.com/apache/spark/pull/12245]

> Allow for overriding both httpclient and httpcore versions
> --
>
> Key: SPARK-14470
> URL: https://issues.apache.org/jira/browse/SPARK-14470
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Aaron Tokhy
>Priority: Minor
> Fix For: 2.0.0
>
>
> The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 
> 'httpcore' versions are the same.  This restriction isn't necessarily true, 
> as you could potentially have an httpclient version of 4.3.6 and an httpcore 
> version of 4.3.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Summary: Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS 
SELECT ..."  (was: Creating temporary table using SQL DDL shouldn't write files 
to file system)

> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet shows that currently temporary table 
> creation writes files to file system:
> {code}
> sqlContext range 10 registerTempTable "t"
> sqlContext sql "create temporary table s using parquet as select * from t"
> {code}
> The problematic code is 
> [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14470) Allow for overriding both httpclient and httpcore versions

2016-04-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14470:
--
Assignee: Aaron Tokhy
Target Version/s: 2.0.0  (was: 1.6.2, 2.0.0)
Priority: Trivial  (was: Minor)

> Allow for overriding both httpclient and httpcore versions
> --
>
> Key: SPARK-14470
> URL: https://issues.apache.org/jira/browse/SPARK-14470
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Aaron Tokhy
>Assignee: Aaron Tokhy
>Priority: Trivial
> Fix For: 2.0.0
>
>
> The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 
> 'httpcore' versions are the same.  This restriction isn't necessarily true, 
> as you could potentially have an httpclient version of 4.3.6 and an httpcore 
> version of 4.3.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14153) My dataset does not provide proper predictions in ALS

2016-04-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14153.
---
Resolution: Duplicate

> My dataset does not provide proper predictions in ALS
> -
>
> Key: SPARK-14153
> URL: https://issues.apache.org/jira/browse/SPARK-14153
> Project: Spark
>  Issue Type: Question
>  Components: Java API, ML
>Reporter: Dulaj Rajitha
>
> When I used data-set in the git-hub example, I get proper predictions. But 
> when I used my data set It does not predict well. (I has a large RMSE). 
> I used cross validator for ALS  (in Spark ML) and here are the best model 
> parameters.
> 16/03/25 12:03:06 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN)
> 16/03/25 12:03:06 INFO CrossValidator: Best set of parameters:
> {
>   als_c911c0e183a3-alpha: 0.02,
>   als_c911c0e183a3-rank: 500,
>   als_c911c0e183a3-regParam: 0.03
> }
> But when I used movie data set It gives proper values for parameters. as below
> 16/03/24 14:07:07 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(1.9481584447713676, 2.0501457159728944, 2.0600857505406935, 
> 1.9457234533860048, 2.0494498583414282, 2.0595306613827002, 
> 1.9488322049918922, 2.0489573853226797, 2.0584252131752, 1.9464006741621391, 
> 2.048241271354197, 2.057853990227443)
> 16/03/24 14:07:07 INFO CrossValidator: Best set of parameters:
> {
>   als_31a605e7717b-alpha: 0.02,
>   als_31a605e7717b-rank: 1,
>   als_31a605e7717b-regParam: 0.02
> }
> 16/03/24 14:07:07 INFO CrossValidator: Best cross-validation metric: 
> 1.9457234533860048.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Description: 
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, whichi is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} key word, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one?
# If it is, what is the expected semantics?


  was:
The following Spark shell snippet shows that currently temporary table creation 
writes files to file system:

{code}
sqlContext range 10 registerTempTable "t"
sqlContext sql "create temporary table s using parquet as select * from t"
{code}

The problematic code is 
[here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137].


> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, whichi is {{/user/hive/warehouse/y}} 
> on my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data from existing files.
> # It has a {{USING }} clause, which is supposed to, I guess, 
> converting the result of the above query into the given format. And by 
> "converting", we have to write out the data into file system.
> # It has a {{TEMPORARY}} key word, which is supposed to, I guess, create an 
> in-memory temporary table using the files written above?
> The main questions:
> # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a 
> valid one?
> # If it is, what is the expected semantics?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14490) spark.yarn.am.attemptFailuresValidityInterval doesn't work

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232073#comment-15232073
 ] 

Sean Owen commented on SPARK-14490:
---

"Not work" is far from sufficient. I will close this unless you can provide a 
lot more detail

> spark.yarn.am.attemptFailuresValidityInterval doesn't work
> --
>
> Key: SPARK-14490
> URL: https://issues.apache.org/jira/browse/SPARK-14490
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Fu Chen
>
> we want yarn to supervise our spark streaming process, and than we set config 
> of spark.yarn.am.attemptFailuresValidityInterval, but it not work for us



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Description: 
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one?
# If it is, what is the expected semantics?


  was:
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, whichi is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} key word, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one?
# If it is, what is the expected semantics?



> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on 
> my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data

[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232078#comment-15232078
 ] 

Cheng Lian commented on SPARK-14488:


Updated title and description of this ticket.

> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on 
> my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data from existing files.
> # It has a {{USING }} clause, which is supposed to, I guess, 
> converting the result of the above query into the given format. And by 
> "converting", we have to write out the data into file system.
> # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
> in-memory temporary table using the files written above?
> The main questions:
> # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a 
> valid one?
> # If it is, what is the expected semantics?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14402.
---
Resolution: Fixed

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232092#comment-15232092
 ] 

Cheng Lian commented on SPARK-14488:


Tried the same snippet using Spark 1.6, and got the following exception, which 
makes sense:
{noformat}
scala> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM 
x"
java.util.NoSuchElementException: key not found: path
at scala.collection.MapLike$class.default(MapLike.scala:228)
at 
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:150)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at 
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:150)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:230)
at 
org.apache.spark.sql.execution.datasources.CreateTempTableUsingAsSelect.run(ddl.scala:112)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $iwC$$iwC$$iwC$$iwC.(:37)
at $iwC$$iwC$$iwC.(:39)
at $iwC$$iwC.(:41)
at $iwC.(:43)
at (:45)
at .(:49)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMai

[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232107#comment-15232107
 ] 

Wenchen Fan commented on SPARK-14488:
-

This command looks very weird to me, does it make sense? Should we remove it?

> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on 
> my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data from existing files.
> # It has a {{USING }} clause, which is supposed to, I guess, 
> converting the result of the above query into the given format. And by 
> "converting", we have to write out the data into file system.
> # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
> in-memory temporary table using the files written above?
> The main questions:
> # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a 
> valid one?
> # If it is, what is the expected semantics?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232111#comment-15232111
 ] 

Cheng Lian commented on SPARK-14488:


However, if {{TEMPORARY + USING + AS SELECT}} is an invalid combination, why do 
we have a [{{CreateTempTableUsingAsSelect}} 
command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
 which exactly maps to this combination?

> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on 
> my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data from existing files.
> # It has a {{USING }} clause, which is supposed to, I guess, 
> converting the result of the above query into the given format. And by 
> "converting", we have to write out the data into file system.
> # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
> in-memory temporary table using the files written above?
> The main questions:
> # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a 
> valid one?
> # If it is, what is the expected semantics?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Description: 
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one? If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} 
command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
 which exactly maps to this combination?
# If it is, what is the expected semantics?


  was:
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one?
# If it is, what is the expected semantics?



> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/

[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232118#comment-15232118
 ] 

Cheng Lian commented on SPARK-14488:


Result of {{EXPLAIN EXTENDED CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * 
FROM x}}:

{noformat}
== Parsed Logical Plan ==
'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- 'Project [*]
   +- 'UnresolvedRelation `x`, None

== Analyzed Logical Plan ==

CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Project [id#0L]
   +- SubqueryAlias x
  +- Range 0, 10, 1, 1, [id#0L]

== Optimized Logical Plan ==
CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Range 0, 10, 1, 1, [id#0L]

== Physical Plan ==
ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
[Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
[id#0L]|
{noformat}

So it seems that the parser drops {{TEMPORARY}}.

> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on 
> my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data from existing files.
> # It has a {{USING }} clause, which is supposed to, I guess, 
> converting the result of the above query into the given format. And by 
> "converting", we have to write out the data into file system.
> # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
> in-memory temporary table using the files written above?
> The main questions:
> # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a 
> valid one? If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} 
> command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
>  which exactly maps to this combination?
> # If it is, what is the expected semantics?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Description: 
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one?
# If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} 
command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
 which exactly maps to this combination?
# If it is, what is the expected semantics?


  was:
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one? If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} 
command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
 which exactly maps to this combination?
# If it is, what is the expected semantics?



> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|

[jira] [Comment Edited] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232092#comment-15232092
 ] 

Cheng Lian edited comment on SPARK-14488 at 4/8/16 12:27 PM:
-

Tried the same snippet using Spark 1.6, and got the following exception, which 
makes sense. I tend to believe that the combination described in the ticket is 
invalid and should be rejected by either parser or analyzer.

{noformat}
scala> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM 
x"
java.util.NoSuchElementException: key not found: path
at scala.collection.MapLike$class.default(MapLike.scala:228)
at 
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:150)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at 
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:150)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:230)
at 
org.apache.spark.sql.execution.datasources.CreateTempTableUsingAsSelect.run(ddl.scala:112)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $iwC$$iwC$$iwC$$iwC.(:37)
at $iwC$$iwC$$iwC.(:39)
at $iwC$$iwC.(:41)
at $iwC.(:43)
at (:45)
at .(:49)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe

[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-08 Thread Steve Johnston (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232187#comment-15232187
 ] 

Steve Johnston commented on SPARK-14389:


I'll verify as soon as EMR goes to 6.1.

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: jps_command_results.txt, lineitem.tbl, plans.txt, 
> sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-14489:
-
Description: 
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

Issue SPARK-14153 seems to the same pbm
{code:title=Bar.scala|borderStyle=solid}
val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
  val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
  val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
  // multi-model training
  logDebug(s"Train split $splitIndex with multiple sets of parameters.")
  val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
  trainingDataset.unpersist()
  var i = 0
  while (i < numModels) {
// TODO: duplicate evaluator to take extra params from input
val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
  }
  validationDataset.unpersist()
}
{code}

  was:
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

Issue SPARK-14153 seems to the same pbm

val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
  val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
  val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
  // multi-model training
  logDebug(s"Train split $splitIndex with multiple sets of parameters.")
  val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
  trainingDataset.unpersist()
  var i = 0
  while (i < numModels) {
// TODO: duplicate evaluator to take extra params from input
val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
  }
  validationDataset.unpersist()
}


> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validat

[jira] [Updated] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-14489:
-
Description: 
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

Issue SPARK-14153 seems to be the same pbm
{code:title=Bar.scala|borderStyle=solid}
val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
  val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
  val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
  // multi-model training
  logDebug(s"Train split $splitIndex with multiple sets of parameters.")
  val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
  trainingDataset.unpersist()
  var i = 0
  while (i < numModels) {
// TODO: duplicate evaluator to take extra params from input
val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
  }
  validationDataset.unpersist()
}
{code}

  was:
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

Issue SPARK-14153 seems to the same pbm
{code:title=Bar.scala|borderStyle=solid}
val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
  val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
  val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
  // multi-model training
  logDebug(s"Train split $splitIndex with multiple sets of parameters.")
  val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
  trainingDataset.unpersist()
  var i = 0
  while (i < numModels) {
// TODO: duplicate evaluator to take extra params from input
val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
  }
  validationDataset.unpersist()
}
{code}


> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   v

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232226#comment-15232226
 ] 

Nick Pentreath commented on SPARK-14489:


This issue would also apply to any ranking-based evaluator for ALS. [~srowen] 
what do you think? A param flag to allow excluding NaNs (with a warning should 
they be encountered) for {{RegressionEvaluator}}?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232241#comment-15232241
 ] 

Sean Owen commented on SPARK-14489:
---

You could also argue that the problem is returning NaN for an unknown user when 
making recommendations. I'm not sure where that comes in, but that seems like a 
reasonable way to address this would be to always provide a value: empty list 
of recommendations, estimated strength of 0 for the implicit case, some kind of 
global average rating for the explicit case.

If the model is allowed to return NaN, and they're ignored by the evaluation 
metric, it is essentially not penalizing the model for 'passing' on a question. 
Making this model return an answer in all cases seems more useful and requires 
no work-around or flag.

 [~dulajrajitha] [~clemencb] what do you think of trying to implement that?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14491) refactor object operator framework to make it easy to eliminate serializations

2016-04-08 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-14491:
---

 Summary: refactor object operator framework to make it easy to 
eliminate serializations
 Key: SPARK-14491
 URL: https://issues.apache.org/jira/browse/SPARK-14491
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-08 Thread Steve Johnston (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Johnston updated SPARK-14389:
---
Comment: was deleted

(was: I'll verify as soon as EMR goes to 6.1.)

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: jps_command_results.txt, lineitem.tbl, plans.txt, 
> sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14491) refactor object operator framework to make it easy to eliminate serializations

2016-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232262#comment-15232262
 ] 

Apache Spark commented on SPARK-14491:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12260

> refactor object operator framework to make it easy to eliminate serializations
> --
>
> Key: SPARK-14491
> URL: https://issues.apache.org/jira/browse/SPARK-14491
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14491) refactor object operator framework to make it easy to eliminate serializations

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14491:


Assignee: Wenchen Fan  (was: Apache Spark)

> refactor object operator framework to make it easy to eliminate serializations
> --
>
> Key: SPARK-14491
> URL: https://issues.apache.org/jira/browse/SPARK-14491
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14491) refactor object operator framework to make it easy to eliminate serializations

2016-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14491:


Assignee: Apache Spark  (was: Wenchen Fan)

> refactor object operator framework to make it easy to eliminate serializations
> --
>
> Key: SPARK-14491
> URL: https://issues.apache.org/jira/browse/SPARK-14491
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."

2016-04-08 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232277#comment-15232277
 ] 

Herman van Hovell commented on SPARK-14488:
---

[~lian cheng] I just did the following:
{noformat}
scala> import org.apache.spark.sql.execution.SparkSqlParser._

scala> parsePlan("CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x")
res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@2f98859a, 
None, Overwrite, Map()
+- 'Project [*]
   +- 'UnresolvedRelation `x`, None

scala> 
res0.asInstanceOf[org.apache.spark.sql.execution.datasources.CreateTableUsingAsSelect].temporary
res1: Boolean = true
{noformat}

The temporary variabele seems to be set.

> Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
> --
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on 
> my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data from existing files.
> # It has a {{USING }} clause, which is supposed to, I guess, 
> converting the result of the above query into the given format. And by 
> "converting", we have to write out the data into file system.
> # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
> in-memory temporary table using the files written above?
> The main questions:
> # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a 
> valid one?
> # If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} 
> command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
>  which exactly maps to this combination?
> # If it is, what is the expected semantics?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala

2016-04-08 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-14279:
--
Description: 

Right now the spark-submit --version and other parts of the code pick up 
version information from a static SPARK_VERSION. We would want to  pick the 
version from the pom.version probably stored inside a properties file. Also, it 
might be nice to have other details like branch, build information and other 
specific details when having a spark-submit --version

the motivation is to more easily tie this to automated continuous integration 
and deployment and to easily have traceability.

Part of this is right now you have to manually change a java file to change the 
version that comes out when you run spark-submit --version. With continuous 
integration the build numbers could be something like 1.6.1.X (where X 
increments on each change) and I want to see the exact version easily. Having 
to manually change a java file makes that hard. obviously that should make the 
apache spark releases easier as you don't have to manually change this file as 
well.

The other important part for me is the git information. This easily lets me 
trace it back to exact commits. We have a multi-tenant YARN cluster and users 
can run many different versions at once. I want to be able to see exactly which 
version they are running. The reason to know exact version can range from 
helping debug some problem to making sure someone didn't hack something in 
Spark to cause bad things (generally they should use approved version), etc.



  was:Right now the spark-submit --version and other parts of the code pick up 
version information from a static SPARK_VERSION. We would want to  pick the 
version from the pom.version probably stored inside a properties file. Also, it 
might be nice to have other details like branch, build information and other 
specific details when having a spark-submit --version


> Improve the spark build to pick the version information from the pom file 
> instead of package.scala
> --
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version
> the motivation is to more easily tie this to automated continuous integration 
> and deployment and to easily have traceability.
> Part of this is right now you have to manually change a java file to change 
> the version that comes out when you run spark-submit --version. With 
> continuous integration the build numbers could be something like 1.6.1.X 
> (where X increments on each change) and I want to see the exact version 
> easily. Having to manually change a java file makes that hard. obviously that 
> should make the apache spark releases easier as you don't have to manually 
> change this file as well.
> The other important part for me is the git information. This easily lets me 
> trace it back to exact commits. We have a multi-tenant YARN cluster and users 
> can run many different versions at once. I want to be able to see exactly 
> which version they are running. The reason to know exact version can range 
> from helping debug some problem to making sure someone didn't hack something 
> in Spark to cause bad things (generally they should use approved version), 
> etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file and add git commit information

2016-04-08 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-14279:
--
Summary: Improve the spark build to pick the version information from the 
pom file and add git commit information  (was: Improve the spark build to pick 
the version information from the pom file instead of package.scala)

> Improve the spark build to pick the version information from the pom file and 
> add git commit information
> 
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version
> the motivation is to more easily tie this to automated continuous integration 
> and deployment and to easily have traceability.
> Part of this is right now you have to manually change a java file to change 
> the version that comes out when you run spark-submit --version. With 
> continuous integration the build numbers could be something like 1.6.1.X 
> (where X increments on each change) and I want to see the exact version 
> easily. Having to manually change a java file makes that hard. obviously that 
> should make the apache spark releases easier as you don't have to manually 
> change this file as well.
> The other important part for me is the git information. This easily lets me 
> trace it back to exact commits. We have a multi-tenant YARN cluster and users 
> can run many different versions at once. I want to be able to see exactly 
> which version they are running. The reason to know exact version can range 
> from helping debug some problem to making sure someone didn't hack something 
> in Spark to cause bad things (generally they should use approved version), 
> etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file and add git commit information

2016-04-08 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-14279:
--
Description: 
Right now the spark-submit --version and other parts of the code pick up 
version information from a static SPARK_VERSION. We would want to  pick the 
version from the pom.version probably stored inside a properties file. Also, it 
might be nice to have other details like branch, build information and other 
specific details when having a spark-submit --version

Note, the motivation is to more easily tie this to automated continuous 
integration and deployment and to easily have traceability.

Part of this is right now you have to manually change a java file to change the 
version that comes out when you run spark-submit --version. With continuous 
integration the build numbers could be something like 1.6.1.X (where X 
increments on each change) and I want to see the exact version easily. Having 
to manually change a java file makes that hard. obviously that should make the 
apache spark releases easier as you don't have to manually change this file as 
well.

The other important part for me is the git information. This easily lets me 
trace it back to exact commits. We have a multi-tenant YARN cluster and users 
can run many different versions at once. I want to be able to see exactly which 
version they are running. The reason to know exact version can range from 
helping debug some problem to making sure someone didn't hack something in 
Spark to cause bad things (generally they should use approved version), etc.



  was:

Right now the spark-submit --version and other parts of the code pick up 
version information from a static SPARK_VERSION. We would want to  pick the 
version from the pom.version probably stored inside a properties file. Also, it 
might be nice to have other details like branch, build information and other 
specific details when having a spark-submit --version

the motivation is to more easily tie this to automated continuous integration 
and deployment and to easily have traceability.

Part of this is right now you have to manually change a java file to change the 
version that comes out when you run spark-submit --version. With continuous 
integration the build numbers could be something like 1.6.1.X (where X 
increments on each change) and I want to see the exact version easily. Having 
to manually change a java file makes that hard. obviously that should make the 
apache spark releases easier as you don't have to manually change this file as 
well.

The other important part for me is the git information. This easily lets me 
trace it back to exact commits. We have a multi-tenant YARN cluster and users 
can run many different versions at once. I want to be able to see exactly which 
version they are running. The reason to know exact version can range from 
helping debug some problem to making sure someone didn't hack something in 
Spark to cause bad things (generally they should use approved version), etc.




> Improve the spark build to pick the version information from the pom file and 
> add git commit information
> 
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version
> Note, the motivation is to more easily tie this to automated continuous 
> integration and deployment and to easily have traceability.
> Part of this is right now you have to manually change a java file to change 
> the version that comes out when you run spark-submit --version. With 
> continuous integration the build numbers could be something like 1.6.1.X 
> (where X increments on each change) and I want to see the exact version 
> easily. Having to manually change a java file makes that hard. obviously that 
> should make the apache spark releases easier as you don't have to manually 
> change this file as well.
> The other important part for me is the git information. This easily lets me 
> trace it back to exact commits. We have a multi-tenant YARN cluster and users 
> can run many different versions at once. I want to be able to see exactly 
> which version they are running. The reason to know exact version can range 
> from helping debug some problem to making sure someon

[jira] [Created] (SPARK-14492) Spark 1.6.0 cannot work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2016-04-08 Thread Sunil Rangwani (JIRA)

Sunil Rangwani created SPARK-14492:
--

 Summary: Spark 1.6.0 cannot work with Hive version lower than 
1.2.0; its not backwards compatible with earlier version
 Key: SPARK-14492
 URL: https://issues.apache.org/jira/browse/SPARK-14492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Sunil Rangwani
Priority: Critical


Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
because this field was introduced in Hive 1.2.0 so its not possible to use Hive 
metastore version lower than 1.2.0 with Spark. The details of the Hive changes 
can be found here: https://issues.apache.org/jira/browse/HIVE-9508 


Exception in thread "main" java.lang.NoSuchFieldError: 
METASTORE_CLIENT_SOCKET_LIFETIME
at 
org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232297#comment-15232297
 ] 

Boris Clémençon  commented on SPARK-14489:
--

Hi all, 

Perhaps the transform() method returning NaN or any other particular symbol (na 
in DF?) to deal with users who are not in the training set is not a bad idea. 
It makes it clear that the method cannot deal with these users (so far), which 
is good. Using multiple methods to compute the score might seem a bit confusing 
(like imputing average, 0 or any fix values), and it could also introduce bias 
in some metrics like MSE. Making a fix in RegressionEvaluator (removing NaN) 
and sending a warn log seems a better option to me. 

Besides, I think it is theoritically possible to evaluate the scores of users 
or items that were not in the training set, just it is possible adding new 
users in a precomputed PCA system of axes. 

I am just a simple Spark user (and fan), but I could try to clone the project 
and push the modif if required.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14492) Spark 1.6.0 cannot work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2016-04-08 Thread Sunil Rangwani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Rangwani updated SPARK-14492:
---
Description: 
Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
because this field was introduced in Hive 1.2.0 so its not possible to use Hive 
metastore version lower than 1.2.0 with Spark. The details of the Hive changes 
can be found here: https://issues.apache.org/jira/browse/HIVE-9508 

```
Exception in thread "main" java.lang.NoSuchFieldError: 
METASTORE_CLIENT_SOCKET_LIFETIME
at 
org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

  was:
Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
because this field was introduced in Hive 1.2.0 so its not possible to use Hive 
metastore version lower than 1.2.0 with Spark. The details of the Hive changes 
can be found here: https://issues.apache.org/jira/browse/HIVE-9508 


Exception in thread "main" java.lang.NoSuchFieldError: 
METASTORE_CLIENT_SOCKET_LIFETIME
at 
org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[jira] [Updated] (SPARK-14492) Spark 1.6.0 cannot work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2016-04-08 Thread Sunil Rangwani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Rangwani updated SPARK-14492:
---
Description: 
Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
because this field was introduced in Hive 1.2.0 so its not possible to use Hive 
metastore version lower than 1.2.0 with Spark. The details of the Hive changes 
can be found here: https://issues.apache.org/jira/browse/HIVE-9508 

{code:java}
Exception in thread "main" java.lang.NoSuchFieldError: 
METASTORE_CLIENT_SOCKET_LIFETIME
at 
org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

  was:
Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
because this field was introduced in Hive 1.2.0 so its not possible to use Hive 
metastore version lower than 1.2.0 with Spark. The details of the Hive changes 
can be found here: https://issues.apache.org/jira/browse/HIVE-9508 

```
Exception in thread "main" java.lang.NoSuchFieldError: 
METASTORE_CLIENT_SOCKET_LIFETIME
at 
org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
at 
org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp

[jira] [Commented] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232335#comment-15232335
 ] 

Sean Owen commented on SPARK-14477:
---

[~mgrover] I just realized why you may be suggesting this -- the ASF mirrors 
are apparently offline for the next *four days*:
http://archive.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

I think that means the builds will fail, if they don't have a cached copy 
locally. Like this one, which built vs branch 1.6 and couldn't find 3.3.3:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55346/console

Maybe it will only affect non-master builds because Jenkins has 3.3.9 locally 
already?

At least this would give us a hook to point to a different mirror in Jenkins 
jobs for the next couple days.

> Allow custom mirrors for downloading artifacts in build/mvn
> ---
>
> Key: SPARK-14477
> URL: https://issues.apache.org/jira/browse/SPARK-14477
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala 
> from. It makes sense to override these locations with mirrors in many cases, 
> so this change will add support for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232357#comment-15232357
 ] 

Sean Owen commented on SPARK-14489:
---

NaN, to me, means the result was undefined or uncomputable. However for 
recommenders there's nothing too strange about being asked for a recommendation 
for a new user. For some methods there's a clear answer: a new user with no 
data gets 0 recommendations; 0 is the meaningful default for the implicit case. 
Some kind of global mean is better than nothing for the explicit case. It 
doesn't bias the metrics, as an answer is an answer; some are better than 
others but that's what we're measuring.

As I say the problem with ignoring NaN is that you don't consider these cases, 
but they're legitimate cases where the recommender wasn't able to produce a 
result, and that should be reflected as "bad".

Still, as a stop-gap, assuming NaN is rare, ignoring NaN in the evaluator is 
strictly an improvement since it means you can return some meaningful answer 
instead of none. Later, if the ALS implementation never returns NaN, then this 
behavior in the evaluator doesn't matter anyway. So I'd support that change as 
a local improvement.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Summary: "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates 
persisted table  (was: Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING 
... AS SELECT ...")

> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY 
> TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird 
> semantics.
> Let's try the following Spark shell snippet:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> *Weird behavior*
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}, and the query result is written in Parquet format 
> under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on 
> my local machine.
> *Weird semantics*
> Secondly, even if this DDL statement does create a temporary table, the 
> semantics is still somewhat weird:
> # It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
> instead of loading data from existing files.
> # It has a {{USING }} clause, which is supposed to, I guess, 
> converting the result of the above query into the given format. And by 
> "converting", we have to write out the data into file system.
> # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
> in-memory temporary table using the files written above?
> The main questions:
> # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a 
> valid one?
> # If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} 
> command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
>  which exactly maps to this combination?
> # If it is, what is the expected semantics?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Description: 
The following Spark shell snippet reproduces this bug:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}.

Explain shows that parser probably drops {{TEMPORARY}} while parsing this 
statement:

{noformat}
== Parsed Logical Plan ==
'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- 'Project [*]
   +- 'UnresolvedRelation `x`, None

== Analyzed Logical Plan ==

CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Project [id#0L]
   +- SubqueryAlias x
  +- Range 0, 10, 1, 1, [id#0L]

== Optimized Logical Plan ==
CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Range 0, 10, 1, 1, [id#0L]

== Physical Plan ==
ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
[Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
[id#0L]|
{noformat}


  was:
Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE 
... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics.

Let's try the following Spark shell snippet:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

*Weird behavior*

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}, and the query result is written in Parquet format under default 
Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local 
machine.

*Weird semantics*

Secondly, even if this DDL statement does create a temporary table, the 
semantics is still somewhat weird:

# It has a {{AS SELECT ...}} clause, which is supposed to run a given query 
instead of loading data from existing files.
# It has a {{USING }} clause, which is supposed to, I guess, converting 
the result of the above query into the given format. And by "converting", we 
have to write out the data into file system.
# It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an 
in-memory temporary table using the files written above?

The main questions:

# Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid 
one?
# If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} 
command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116],
 which exactly maps to this combination?
# If it is, what is the expected semantics?



> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this bug:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}.
> Explain shows that parser probably drops {{TEMPORARY}} while parsing this 
> statement:
> {noformat}
> == Parsed Logical Plan ==
> 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- 'Project [*]
>+- 'UnresolvedRelation `x`, None
> == Analyzed Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Project [id#0L]
>+- SubqueryAlias x
>   +- Range 0, 10, 1, 1, [id#0L]
> == Op

[jira] [Updated] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14488:
---
Description: 
The following Spark shell snippet reproduces this bug:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}.

Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} 
rather than {{CreateTempTableUsingAsSelect}}.

{noformat}
== Parsed Logical Plan ==
'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- 'Project [*]
   +- 'UnresolvedRelation `x`, None

== Analyzed Logical Plan ==

CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Project [id#0L]
   +- SubqueryAlias x
  +- Range 0, 10, 1, 1, [id#0L]

== Optimized Logical Plan ==
CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Range 0, 10, 1, 1, [id#0L]

== Physical Plan ==
ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
[Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
[id#0L]|
{noformat}


  was:
The following Spark shell snippet reproduces this bug:

{code}
sqlContext range 10 registerTempTable "x"

// The problematic DDL statement:
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"

sqlContext.tables().show()
{code}

It shows the following result:

{noformat}
+-+---+
|tableName|isTemporary|
+-+---+
|y|  false|
|x|   true|
+-+---+
{noformat}

Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY 
TABLE ...}}.

Explain shows that parser probably drops {{TEMPORARY}} while parsing this 
statement:

{noformat}
== Parsed Logical Plan ==
'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- 'Project [*]
   +- 'UnresolvedRelation `x`, None

== Analyzed Logical Plan ==

CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Project [id#0L]
   +- SubqueryAlias x
  +- Range 0, 10, 1, 1, [id#0L]

== Optimized Logical Plan ==
CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
None, Overwrite, Map()
+- Range 0, 10, 1, 1, [id#0L]

== Physical Plan ==
ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
[Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
[id#0L]|
{noformat}



> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this bug:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}.
> Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} 
> rather than {{CreateTempTableUsingAsSelect}}.
> {noformat}
> == Parsed Logical Plan ==
> 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- 'Project [*]
>+- 'UnresolvedRelation `x`, None
> == Analyzed Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Project [id#0L]
>+- SubqueryAlias x
>   +- Range 0, 10, 1, 1, [id#0L]
> == Optimized Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Range 0, 10, 1, 1, [id#0L]
> == Physical Plan ==
> ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
> [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
> [id#0L]|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---

[jira] [Commented] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232401#comment-15232401
 ] 

Cheng Lian commented on SPARK-14488:


Ah, sorry, the logical plan class {{CreateTableUsingAsSelect}} uses a boolean 
flag to indicate whether the table is temporary or not, while physical plan 
uses two different classes {{CreateTempTableUsingAsSelect}} and 
{{CreateTableUsingAsSelect}}. Then something is probably wrong in the planner.

> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this bug:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}.
> Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} 
> rather than {{CreateTempTableUsingAsSelect}}.
> {noformat}
> == Parsed Logical Plan ==
> 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- 'Project [*]
>+- 'UnresolvedRelation `x`, None
> == Analyzed Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Project [id#0L]
>+- SubqueryAlias x
>   +- Range 0, 10, 1, 1, [id#0L]
> == Optimized Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Range 0, 10, 1, 1, [id#0L]
> == Physical Plan ==
> ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
> [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
> [id#0L]|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232402#comment-15232402
 ] 

Herman van Hovell commented on SPARK-14488:
---

{{CreateTempTableUsingAsSelect}} should be planned by {{SparkStrategies 
DDLStrategy}}, see: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L428-L431

> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this bug:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}.
> Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} 
> rather than {{CreateTempTableUsingAsSelect}}.
> {noformat}
> == Parsed Logical Plan ==
> 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- 'Project [*]
>+- 'UnresolvedRelation `x`, None
> == Analyzed Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Project [id#0L]
>+- SubqueryAlias x
>   +- Range 0, 10, 1, 1, [id#0L]
> == Optimized Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Range 0, 10, 1, 1, [id#0L]
> == Physical Plan ==
> ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
> [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
> [id#0L]|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232414#comment-15232414
 ] 

Cheng Lian commented on SPARK-14488:


Discussed with [~yhuai] offline, and here's the summary:

{{CreateTempTableUsingAsSelect}} existed since 1.3 (I'm surprised that I never 
noticed it!). Its semantics is:

# Execute the {{SELECT}} query.
# Store query result to a user specified position in filesystem. Note that this 
means the {{PATH}} data source option should always be set when using this DDL 
command.
# Create a temporary table using written files.

Basically, it can be used to dump query results to the filesystem without 
creating persisted tables. It's indeed a confusing  and is kinda equivalent to 
the following DDL sequence:

- {{INSERT OVERWRITE DIRECTORY ... STORE AS ... SELECT ...}}
- {{CREATE TEMPORARY TABLE ... USING ... OPTION (PATH ...)}}

However, Spark hasn't implemented {{INSERT OVERWRITE DIRECTORY}} yet. In the 
long run, we should implement it and deprecate this confusing DDL command.

Ticket title and description were updated accordingly.

> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this bug:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}.
> Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} 
> rather than {{CreateTempTableUsingAsSelect}}.
> {noformat}
> == Parsed Logical Plan ==
> 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- 'Project [*]
>+- 'UnresolvedRelation `x`, None
> == Analyzed Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Project [id#0L]
>+- SubqueryAlias x
>   +- Range 0, 10, 1, 1, [id#0L]
> == Optimized Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Range 0, 10, 1, 1, [id#0L]
> == Physical Plan ==
> ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
> [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
> [id#0L]|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232414#comment-15232414
 ] 

Cheng Lian edited comment on SPARK-14488 at 4/8/16 4:17 PM:


Discussed with [~yhuai] offline, and here's the summary:

{{CreateTempTableUsingAsSelect}} existed since 1.3 (I'm surprised that I never 
noticed it!). Its semantics is:

# Execute the {{SELECT}} query.
# Store query result to a user specified position in filesystem. Note that this 
means the {{PATH}} data source option should always be set when using this DDL 
command.
# Create a temporary table using written files.

Basically, it can be used to dump query results to the filesystem without 
creating persisted tables. It's indeed a confusing command and is kinda 
equivalent to the following DDL sequence:

- {{INSERT OVERWRITE DIRECTORY ... STORE AS ... SELECT ...}}
- {{CREATE TEMPORARY TABLE ... USING ... OPTION (PATH ...)}}

However, Spark hasn't implemented {{INSERT OVERWRITE DIRECTORY}} yet. In the 
long run, we should implement it and deprecate this confusing DDL command.

Ticket title and description were updated accordingly.


was (Author: lian cheng):
Discussed with [~yhuai] offline, and here's the summary:

{{CreateTempTableUsingAsSelect}} existed since 1.3 (I'm surprised that I never 
noticed it!). Its semantics is:

# Execute the {{SELECT}} query.
# Store query result to a user specified position in filesystem. Note that this 
means the {{PATH}} data source option should always be set when using this DDL 
command.
# Create a temporary table using written files.

Basically, it can be used to dump query results to the filesystem without 
creating persisted tables. It's indeed a confusing  and is kinda equivalent to 
the following DDL sequence:

- {{INSERT OVERWRITE DIRECTORY ... STORE AS ... SELECT ...}}
- {{CREATE TEMPORARY TABLE ... USING ... OPTION (PATH ...)}}

However, Spark hasn't implemented {{INSERT OVERWRITE DIRECTORY}} yet. In the 
long run, we should implement it and deprecate this confusing DDL command.

Ticket title and description were updated accordingly.

> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this bug:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}.
> Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} 
> rather than {{CreateTempTableUsingAsSelect}}.
> {noformat}
> == Parsed Logical Plan ==
> 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- 'Project [*]
>+- 'UnresolvedRelation `x`, None
> == Analyzed Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Project [id#0L]
>+- SubqueryAlias x
>   +- Range 0, 10, 1, 1, [id#0L]
> == Optimized Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Range 0, 10, 1, 1, [id#0L]
> == Physical Plan ==
> ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
> [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
> [id#0L]|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table

2016-04-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232422#comment-15232422
 ] 

Cheng Lian commented on SPARK-14488:


Yea, that's why I came to this DDL command, because this command seems to be 
the only way to trigger {{CreateTempTableUsingAsSelect}}. However, the physical 
plan doesn't use it. Will look into this. Thanks!

> "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
> 
>
> Key: SPARK-14488
> URL: https://issues.apache.org/jira/browse/SPARK-14488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this bug:
> {code}
> sqlContext range 10 registerTempTable "x"
> // The problematic DDL statement:
> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
> sqlContext.tables().show()
> {code}
> It shows the following result:
> {noformat}
> +-+---+
> |tableName|isTemporary|
> +-+---+
> |y|  false|
> |x|   true|
> +-+---+
> {noformat}
> Note that {{y}} is NOT temporary although it's created using {{CREATE 
> TEMPORARY TABLE ...}}.
> Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} 
> rather than {{CreateTempTableUsingAsSelect}}.
> {noformat}
> == Parsed Logical Plan ==
> 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- 'Project [*]
>+- 'UnresolvedRelation `x`, None
> == Analyzed Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Project [id#0L]
>+- SubqueryAlias x
>   +- Range 0, 10, 1, 1, [id#0L]
> == Optimized Logical Plan ==
> CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, 
> None, Overwrite, Map()
> +- Range 0, 10, 1, 1, [id#0L]
> == Physical Plan ==
> ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, 
> [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, 
> [id#0L]|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14493) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." should always be used with a user defined path

2016-04-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-14493:
--

 Summary: "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." 
should always be used with a user defined path
 Key: SPARK-14493
 URL: https://issues.apache.org/jira/browse/SPARK-14493
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


In current Spark 2.0 master, the following DDL command doesn't specify a 
user-defined path, and writes query result to default Hive warehouse location

{code}
sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x"
{code}

In Spark 1.6, it results in the following exception, which is expected behavior:

{noformat}
scala> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM 
x"
java.util.NoSuchElementException: key not found: path
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14475) Propagate user-defined context from driver to executors

2016-04-08 Thread Eric Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232455#comment-15232455
 ] 

Eric Liang commented on SPARK-14475:


I think the main difference is that this is transparent to the user code,
I.e. can be used by framework or application authors. Otherwise you can of
course propagate arbitrary values already with closures.




> Propagate user-defined context from driver to executors
> ---
>
> Key: SPARK-14475
> URL: https://issues.apache.org/jira/browse/SPARK-14475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>
> It would be useful (e.g. for tracing) to automatically propagate arbitrary 
> user defined context (i.e. thread-locals) from the driver to executors. We 
> can do this easily by adding sc.localProperties to TaskContext.
> cc [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14433) PySpark ml GaussianMixture

2016-04-08 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232465#comment-15232465
 ] 

Miao Wang commented on SPARK-14433:
---

Thanks! I started to learn the usage and will begin coding later today.

Miao

> PySpark ml GaussianMixture
> --
>
> Key: SPARK-14433
> URL: https://issues.apache.org/jira/browse/SPARK-14433
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Python wrapper for GaussianMixture in spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?

2016-04-08 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232468#comment-15232468
 ] 

Yanbo Liang commented on SPARK-14478:
-

Should we add a param that control whether use biased or unbiased variance in 
StandardScaler?

> Should StandardScaler use biased variance to scale?
> ---
>
> Key: SPARK-14478
> URL: https://issues.apache.org/jira/browse/SPARK-14478
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> Currently, MLlib's StandardScaler scales columns using the unbiased standard 
> deviation.  This matches what R's scale package does.
> However, it is a bit odd for 2 reasons:
> * Optimization/ML algorithms which require scaled columns generally assume 
> unit variance (for mathematical convenience).  That requires using biased 
> variance.
> * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance.
> *Question*: Should we switch to unbiased?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 194 matches

Mail list logo