[jira] [Commented] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column

2015-11-26 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028334#comment-15028334
 ] 

Dilip Biswal commented on SPARK-11997:
--

I would like to work on this issue.

> NPE when save a DataFrame as parquet and partitioned by long column
> ---
>
> Key: SPARK-11997
> URL: https://issues.apache.org/jira/browse/SPARK-11997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Priority: Blocker
>
> {code}
> >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - 
> >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3")
> 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:242)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)

[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2015-11-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028340#comment-15028340
 ] 

Saisai Shao commented on SPARK-12009:
-

Looking at the code again, {{onDisconnected}} in {{ApplicationMaster}} will 
call {{finish()}} to interrupt the report thread, if report thread is 
interrupted, there's no chance for {{YarnAllocator}} to request new containers. 
It's quite weird how this happened.

> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Priority: Minor
>
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-11-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028349#comment-15028349
 ] 

Jeff Zhang commented on SPARK-4924:
---

[~vanzin] Is there any user document about it ? I didn't find it on the spark 
official site. If this is not production ready, I think adding documentation to 
let users know would be a good start. 

> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.4.0
>
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11973) Filter pushdown does not work with aggregation with alias

2015-11-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11973.

   Resolution: Fixed
Fix Version/s: 1.6.0

> Filter pushdown does not work with aggregation with alias
> -
>
> Key: SPARK-11973
> URL: https://issues.apache.org/jira/browse/SPARK-11973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11987) Python API/Programming guide update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028358#comment-15028358
 ] 

Xusen Yin commented on SPARK-11987:
---

I am working on it

> Python API/Programming guide update for ChiSqSelector and QuantileDiscretizer
> -
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package. 
> Then add Python APIs to the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column

2015-11-26 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028334#comment-15028334
 ] 

Dilip Biswal edited comment on SPARK-11997 at 11/26/15 8:32 AM:


I would like to work on this issue. Currently testing the patch.


was (Author: dkbiswal):
I would like to work on this issue.

> NPE when save a DataFrame as parquet and partitioned by long column
> ---
>
> Key: SPARK-11997
> URL: https://issues.apache.org/jira/browse/SPARK-11997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Priority: Blocker
>
> {code}
> >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - 
> >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3")
> 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:242)
>   at org.apa

[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2015-11-26 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028363#comment-15028363
 ] 

SuYan commented on SPARK-12009:
---

AM is not exit, it will exit while driver execute its usercode in userThread.  

the below logs tell that the a executor is terminated.
2015-11-26,03:05:16,791 INFO 
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated or 
disconnected! Shutting down. XX.XX.XX.XX:38734


> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Priority: Minor
>
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2015-11-26 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028363#comment-15028363
 ] 

SuYan edited comment on SPARK-12009 at 11/26/15 8:42 AM:
-

AM is not exit, it will exit while driver complete its usercode in userThread.  

the below logs tell that the a executor is terminated.
2015-11-26,03:05:16,791 INFO 
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated or 
disconnected! Shutting down. XX.XX.XX.XX:38734



was (Author: suyan):
AM is not exit, it will exit while driver execute its usercode in userThread.  

the below logs tell that the a executor is terminated.
2015-11-26,03:05:16,791 INFO 
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated or 
disconnected! Shutting down. XX.XX.XX.XX:38734


> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Priority: Minor
>
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2015-11-26 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028363#comment-15028363
 ] 

SuYan edited comment on SPARK-12009 at 11/26/15 8:42 AM:
-

AM is not exit, it will exit while driver complete its usercode in userThread.  

the below logs tell that a executor is terminated.
2015-11-26,03:05:16,791 INFO 
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated or 
disconnected! Shutting down. XX.XX.XX.XX:38734



was (Author: suyan):
AM is not exit, it will exit while driver complete its usercode in userThread.  

the below logs tell that the a executor is terminated.
2015-11-26,03:05:16,791 INFO 
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated or 
disconnected! Shutting down. XX.XX.XX.XX:38734


> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Priority: Minor
>
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2015-11-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028370#comment-15028370
 ] 

Saisai Shao commented on SPARK-12009:
-

If I understood correctly, this log should come from application master, I 
don't think it related to executor termination.

{code}
override def onDisconnected(remoteAddress: RpcAddress): Unit = {
  // In cluster mode, do not rely on the disassociated event to exit
  // This avoids potentially reporting incorrect exit codes if the driver 
fails
  if (!isClusterMode) {
logInfo(s"Driver terminated or disconnected! Shutting down. 
$remoteAddress")
finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
  }
}
{code}

> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Priority: Minor
>
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12011) Stddev/Variance etc should support columnName as arguments

2015-11-26 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-12011:
---

 Summary: Stddev/Variance etc should support columnName as arguments
 Key: SPARK-12011
 URL: https://issues.apache.org/jira/browse/SPARK-12011
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yanbo Liang


Spark SQL aggregate function 
stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis/collect_list/collect_set
 should support columnName as arguments like other aggregate 
function(max/min/count/sum). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12011) Stddev/Variance etc should support columnName as arguments

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028378#comment-15028378
 ] 

Apache Spark commented on SPARK-12011:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9994

> Stddev/Variance etc should support columnName as arguments
> --
>
> Key: SPARK-12011
> URL: https://issues.apache.org/jira/browse/SPARK-12011
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> Spark SQL aggregate function 
> stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis/collect_list/collect_set
>  should support columnName as arguments like other aggregate 
> function(max/min/count/sum). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12011) Stddev/Variance etc should support columnName as arguments

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12011:


Assignee: (was: Apache Spark)

> Stddev/Variance etc should support columnName as arguments
> --
>
> Key: SPARK-12011
> URL: https://issues.apache.org/jira/browse/SPARK-12011
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> Spark SQL aggregate function 
> stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis/collect_list/collect_set
>  should support columnName as arguments like other aggregate 
> function(max/min/count/sum). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2015-11-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028375#comment-15028375
 ] 

Saisai Shao commented on SPARK-12009:
-

Looks like you're running on yarn-cluster mode, but this log will only will be 
printed in yarn-client mode, is it possible that some other places printed same 
log?

> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Priority: Minor
>
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12011) Stddev/Variance etc should support columnName as arguments

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12011:


Assignee: Apache Spark

> Stddev/Variance etc should support columnName as arguments
> --
>
> Key: SPARK-12011
> URL: https://issues.apache.org/jira/browse/SPARK-12011
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Spark SQL aggregate function 
> stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis/collect_list/collect_set
>  should support columnName as arguments like other aggregate 
> function(max/min/count/sum). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-26 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028381#comment-15028381
 ] 

Huaxin Gao commented on SPARK-12010:


I would like to work on this problem. 

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11973) Filter pushdown does not work with aggregation with alias

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028393#comment-15028393
 ] 

Apache Spark commented on SPARK-11973:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9995

> Filter pushdown does not work with aggregation with alias
> -
>
> Key: SPARK-11973
> URL: https://issues.apache.org/jira/browse/SPARK-11973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12005) VerifyError in HyperLogLogPlusPlus with newer JDKs

2015-11-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12005.
-
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> VerifyError in HyperLogLogPlusPlus with newer JDKs
> --
>
> Key: SPARK-12005
> URL: https://issues.apache.org/jira/browse/SPARK-12005
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.7.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 1.6.0
>
>
> Not sure if this affects 1.6.0, but it might. I get this error with 1.7.0_67:
> {noformat}
> [info] Exception encountered when attempting to run a suite with class
> name: org.apache.spark.sql.execution.ui.SQLListenerMemoryLeakSuite ***
> ABORTED *** (4 seconds, 111 milliseconds)
> [info]   java.lang.VerifyError: Bad  method call from inside of a branch
> [info] Exception Details:
> [info]   Location:
> [info] 
> org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/expressions/Expression;)V
> @82: invokespecial
> {noformat}
> People on the internet seem to see similar errors with newer JDK8 builds also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-26 Thread Christian Kurz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028404#comment-15028404
 ] 

Christian Kurz commented on SPARK-12010:


Hi Huaxin,
thank you for your kind offer. 
Actually I already have some code suggestion available, which I hope to create 
a pull request for shortly. May be you could review and add your thoughts to 
the pull request?
Thanks,
Christian

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8369) Support dependency jar and files on HDFS in standalone cluster mode

2015-11-26 Thread tawan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028437#comment-15028437
 ] 

tawan commented on SPARK-8369:
--

I am working on it now and will post a PR for it.

> Support dependency jar and files on HDFS in standalone cluster mode
> ---
>
> Key: SPARK-8369
> URL: https://issues.apache.org/jira/browse/SPARK-8369
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Dong Lei
>
> Currently, in standalone cluster mode, spark can take care of the app-jar 
> whether the app-jar is specified by file:// or hdfs://. But the dependencies 
> specified by --jars and --files do not support a hdfs:// prefix. 
> For example:
> spark-submit 
>  ...
> --jars hdfs://path1/1.jar hdfs://path2/2.jar
> --files hdfs://path3/3.file hdfs://path4/4.file
> hdfs://path5/app.jar
> only app.jar will be downloaded to the driver and distributed to executors, 
> others (1.jar, 2.jar. 3.file, 4.file) will not. 
> I think such a feature is useful for users. 
> 
> To support such a feature, I think we can treat the jars and files like the 
> app jar in DriverRunner. We download them and replace the remote addresses 
> with local addresses. And the DriverWrapper will not be aware.  
> The problem is it's not easy to replace these addresses than replace the 
> location app jar, because we have a placeholder for app jar "<>".  
> We may need to do some string matching to achieve it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11950) Exception throws when executing “exit;” in spark-sql

2015-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11950.
---
Resolution: Duplicate

> Exception throws when executing “exit;” in spark-sql
> 
>
> Key: SPARK-11950
> URL: https://issues.apache.org/jira/browse/SPARK-11950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> spark-sql> exit;
> Exception in thread "main" java.lang.ClassCastException: 
> org.apache.hadoop.hive.ql.session.SessionState cannot be cast to 
> org.apache.hadoop.hive.cli.CliSessionState
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:739)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:123)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-26 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028451#comment-15028451
 ] 

Stavros Kontopoulos commented on SPARK-11638:
-

I verified the example for spark version spark 1.5.1. It seems to work fine. It 
makes possible the bidirectional communication between mresos master/notebook 
within a container to the outside of it, specifically the executors running on 
the host machine. There are some stuff to be addressed like: checking how it 
affects TorrentBroadcast communication (it is tested with HttpBroadcast).
Also enough tests tests should be added and spark 1.6 should be supported and 
tested .

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow an

[jira] [Comment Edited] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-26 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028451#comment-15028451
 ] 

Stavros Kontopoulos edited comment on SPARK-11638 at 11/26/15 9:52 AM:
---

I verified the example for spark version spark 1.5.1. It seems to work fine. It 
makes possible the bidirectional communication between mesos master/notebook 
within a container to the outside of it, specifically the executors running on 
the host machine. There are some stuff to be addressed like: checking how it 
affects TorrentBroadcast communication (it is tested with HttpBroadcast).
Also enough tests should be added and spark 1.6 should be supported and tested .


was (Author: skonto):
I verified the example for spark version spark 1.5.1. It seems to work fine. It 
makes possible the bidirectional communication between mresos master/notebook 
within a container to the outside of it, specifically the executors running on 
the host machine. There are some stuff to be addressed like: checking how it 
affects TorrentBroadcast communication (it is tested with HttpBroadcast).
Also enough tests tests should be added and spark 1.6 should be supported and 
tested .

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 

[jira] [Comment Edited] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-26 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028451#comment-15028451
 ] 

Stavros Kontopoulos edited comment on SPARK-11638 at 11/26/15 9:54 AM:
---

I verified the example for spark version spark 1.5.1. It seems to work fine. It 
makes possible the bidirectional communication between mesos master/notebook 
within a container to the outside of it, specifically the executors running on 
the host machine. There are some stuff to be addressed like: checking how it 
affects TorrentBroadcast communication (it is tested with HttpBroadcast).
Also enough tests should be added and spark 1.6 should be supported and tested 
(is this finished?).


was (Author: skonto):
I verified the example for spark version spark 1.5.1. It seems to work fine. It 
makes possible the bidirectional communication between mesos master/notebook 
within a container to the outside of it, specifically the executors running on 
the host machine. There are some stuff to be addressed like: checking how it 
affects TorrentBroadcast communication (it is tested with HttpBroadcast).
Also enough tests should be added and spark 1.6 should be supported and tested .

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the 

[jira] [Assigned] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11997:


Assignee: Apache Spark

> NPE when save a DataFrame as parquet and partitioned by long column
> ---
>
> Key: SPARK-11997
> URL: https://issues.apache.org/jira/browse/SPARK-11997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Blocker
>
> {code}
> >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - 
> >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3")
> 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:242)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache

[jira] [Commented] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028539#comment-15028539
 ] 

Apache Spark commented on SPARK-11997:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/10001

> NPE when save a DataFrame as parquet and partitioned by long column
> ---
>
> Key: SPARK-11997
> URL: https://issues.apache.org/jira/browse/SPARK-11997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Priority: Blocker
>
> {code}
> >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - 
> >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3")
> 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:242)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at 

[jira] [Assigned] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11997:


Assignee: (was: Apache Spark)

> NPE when save a DataFrame as parquet and partitioned by long column
> ---
>
> Key: SPARK-11997
> URL: https://issues.apache.org/jira/browse/SPARK-11997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Priority: Blocker
>
> {code}
> >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - 
> >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3")
> 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:242)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWrite

[jira] [Commented] (SPARK-11405) ROW_NUMBER function does not adhere to window ORDER BY, when joining

2015-11-26 Thread Jarno Seppanen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028626#comment-15028626
 ] 

Jarno Seppanen commented on SPARK-11405:


Hi, I had a chance of trying this on Spark 1.5.2 and it's fixed there. I'm 
closing the bug.

> ROW_NUMBER function does not adhere to window ORDER BY, when joining
> 
>
> Key: SPARK-11405
> URL: https://issues.apache.org/jira/browse/SPARK-11405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: YARN
>Reporter: Jarno Seppanen
>Priority: Critical
>
> The following query produces incorrect results:
> {code:sql}
> sqlContext.sql("""
>   SELECT a.i, a.x,
> ROW_NUMBER() OVER (
>   PARTITION BY a.i ORDER BY a.x) AS row_num
>   FROM a
>   JOIN b ON b.i = a.i
> """).show()
> +---++---+
> |  i|   x|row_num|
> +---++---+
> |  1|  0.8717439935587555|  1|
> |  1|  0.6684483939068196|  2|
> |  1|  0.3378351523586306|  3|
> |  1|  0.2483285619632939|  4|
> |  1|  0.4796752841655936|  5|
> |  2|  0.2971739640384895|  1|
> |  2|  0.2199359901600595|  2|
> |  2|  0.4646004597998037|  3|
> |  2| 0.24823688829578183|  4|
> |  2|  0.5914212915574378|  5|
> |  3|0.010912835935112164|  1|
> |  3|  0.6520139509583123|  2|
> |  3|  0.8571994559240592|  3|
> |  3|  0.1122635843020473|  4|
> |  3| 0.45913022936460457|  5|
> +---++---+
> {code}
> The row number doesn't follow the correct order. The join seems to break the 
> order, ROW_NUMBER() works correctly if the join results are saved to a 
> temporary table, and a second query is made.
> Here's a small PySpark test case to reproduce the error:
> {code}
> from pyspark.sql import Row
> import random
> a = sc.parallelize([Row(i=i, x=random.random())
> for i in range(5)
> for j in range(5)])
> b = sc.parallelize([Row(i=i) for i in [1, 2, 3]])
> af = sqlContext.createDataFrame(a)
> bf = sqlContext.createDataFrame(b)
> af.registerTempTable('a')
> bf.registerTempTable('b')
> af.show()
> # +---++
> # |  i|   x|
> # +---++
> # |  0| 0.12978974167478896|
> # |  0|  0.7105927498584452|
> # |  0| 0.21225679077448045|
> # |  0| 0.03849717391728036|
> # |  0|  0.4976622146442401|
> # |  1|  0.4796752841655936|
> # |  1|  0.8717439935587555|
> # |  1|  0.6684483939068196|
> # |  1|  0.3378351523586306|
> # |  1|  0.2483285619632939|
> # |  2|  0.2971739640384895|
> # |  2|  0.2199359901600595|
> # |  2|  0.5914212915574378|
> # |  2| 0.24823688829578183|
> # |  2|  0.4646004597998037|
> # |  3|  0.1122635843020473|
> # |  3|  0.6520139509583123|
> # |  3| 0.45913022936460457|
> # |  3|0.010912835935112164|
> # |  3|  0.8571994559240592|
> # +---++
> # only showing top 20 rows
> bf.show()
> # +---+
> # |  i|
> # +---+
> # |  1|
> # |  2|
> # |  3|
> # +---+
> ### WRONG
> sqlContext.sql("""
>   SELECT a.i, a.x,
> ROW_NUMBER() OVER (
>   PARTITION BY a.i ORDER BY a.x) AS row_num
>   FROM a
>   JOIN b ON b.i = a.i
> """).show()
> # +---++---+
> # |  i|   x|row_num|
> # +---++---+
> # |  1|  0.8717439935587555|  1|
> # |  1|  0.6684483939068196|  2|
> # |  1|  0.3378351523586306|  3|
> # |  1|  0.2483285619632939|  4|
> # |  1|  0.4796752841655936|  5|
> # |  2|  0.2971739640384895|  1|
> # |  2|  0.2199359901600595|  2|
> # |  2|  0.4646004597998037|  3|
> # |  2| 0.24823688829578183|  4|
> # |  2|  0.5914212915574378|  5|
> # |  3|0.010912835935112164|  1|
> # |  3|  0.6520139509583123|  2|
> # |  3|  0.8571994559240592|  3|
> # |  3|  0.1122635843020473|  4|
> # |  3| 0.45913022936460457|  5|
> # +---++---+
> ### WORKAROUND BY USING TEMP TABLE
> t = sqlContext.sql("""
>   SELECT a.i, a.x
>   FROM a
>   JOIN b ON b.i = a.i
> """).cache()
> # trigger computation
> t.head()
> t.registerTempTable('t')
> sqlContext.sql("""
>   SELECT i, x,
> ROW_NUMBER() OVER (
>   PARTITION BY i ORDER BY x) AS row_num
>   FROM t
> """).show()
> # +---++---+
> # |  i|   x|row_num|
> # +---++---+
> # |  1|  0.2483285619632939|  1|
> # |  1|  0.3378351523586306|  2|
> # |  1|  0.4796752841655936|  3|
> # |  1|  0.6684483939068196|  4|
> # |  1|  0.8717439935587555|  5|
> # |  2|  0.2199359901600595|  1|
> # |  2| 0.24823688829578183|  2|
> # |  2|  0.2971739640384895|  3|
> # |  2|  0.4646004597998037|  4|
> # |  2|  0.59142129155743

[jira] [Closed] (SPARK-11405) ROW_NUMBER function does not adhere to window ORDER BY, when joining

2015-11-26 Thread Jarno Seppanen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarno Seppanen closed SPARK-11405.
--
   Resolution: Fixed
Fix Version/s: 1.5.2

Works in Spark 1.5.2

> ROW_NUMBER function does not adhere to window ORDER BY, when joining
> 
>
> Key: SPARK-11405
> URL: https://issues.apache.org/jira/browse/SPARK-11405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: YARN
>Reporter: Jarno Seppanen
>Priority: Critical
> Fix For: 1.5.2
>
>
> The following query produces incorrect results:
> {code:sql}
> sqlContext.sql("""
>   SELECT a.i, a.x,
> ROW_NUMBER() OVER (
>   PARTITION BY a.i ORDER BY a.x) AS row_num
>   FROM a
>   JOIN b ON b.i = a.i
> """).show()
> +---++---+
> |  i|   x|row_num|
> +---++---+
> |  1|  0.8717439935587555|  1|
> |  1|  0.6684483939068196|  2|
> |  1|  0.3378351523586306|  3|
> |  1|  0.2483285619632939|  4|
> |  1|  0.4796752841655936|  5|
> |  2|  0.2971739640384895|  1|
> |  2|  0.2199359901600595|  2|
> |  2|  0.4646004597998037|  3|
> |  2| 0.24823688829578183|  4|
> |  2|  0.5914212915574378|  5|
> |  3|0.010912835935112164|  1|
> |  3|  0.6520139509583123|  2|
> |  3|  0.8571994559240592|  3|
> |  3|  0.1122635843020473|  4|
> |  3| 0.45913022936460457|  5|
> +---++---+
> {code}
> The row number doesn't follow the correct order. The join seems to break the 
> order, ROW_NUMBER() works correctly if the join results are saved to a 
> temporary table, and a second query is made.
> Here's a small PySpark test case to reproduce the error:
> {code}
> from pyspark.sql import Row
> import random
> a = sc.parallelize([Row(i=i, x=random.random())
> for i in range(5)
> for j in range(5)])
> b = sc.parallelize([Row(i=i) for i in [1, 2, 3]])
> af = sqlContext.createDataFrame(a)
> bf = sqlContext.createDataFrame(b)
> af.registerTempTable('a')
> bf.registerTempTable('b')
> af.show()
> # +---++
> # |  i|   x|
> # +---++
> # |  0| 0.12978974167478896|
> # |  0|  0.7105927498584452|
> # |  0| 0.21225679077448045|
> # |  0| 0.03849717391728036|
> # |  0|  0.4976622146442401|
> # |  1|  0.4796752841655936|
> # |  1|  0.8717439935587555|
> # |  1|  0.6684483939068196|
> # |  1|  0.3378351523586306|
> # |  1|  0.2483285619632939|
> # |  2|  0.2971739640384895|
> # |  2|  0.2199359901600595|
> # |  2|  0.5914212915574378|
> # |  2| 0.24823688829578183|
> # |  2|  0.4646004597998037|
> # |  3|  0.1122635843020473|
> # |  3|  0.6520139509583123|
> # |  3| 0.45913022936460457|
> # |  3|0.010912835935112164|
> # |  3|  0.8571994559240592|
> # +---++
> # only showing top 20 rows
> bf.show()
> # +---+
> # |  i|
> # +---+
> # |  1|
> # |  2|
> # |  3|
> # +---+
> ### WRONG
> sqlContext.sql("""
>   SELECT a.i, a.x,
> ROW_NUMBER() OVER (
>   PARTITION BY a.i ORDER BY a.x) AS row_num
>   FROM a
>   JOIN b ON b.i = a.i
> """).show()
> # +---++---+
> # |  i|   x|row_num|
> # +---++---+
> # |  1|  0.8717439935587555|  1|
> # |  1|  0.6684483939068196|  2|
> # |  1|  0.3378351523586306|  3|
> # |  1|  0.2483285619632939|  4|
> # |  1|  0.4796752841655936|  5|
> # |  2|  0.2971739640384895|  1|
> # |  2|  0.2199359901600595|  2|
> # |  2|  0.4646004597998037|  3|
> # |  2| 0.24823688829578183|  4|
> # |  2|  0.5914212915574378|  5|
> # |  3|0.010912835935112164|  1|
> # |  3|  0.6520139509583123|  2|
> # |  3|  0.8571994559240592|  3|
> # |  3|  0.1122635843020473|  4|
> # |  3| 0.45913022936460457|  5|
> # +---++---+
> ### WORKAROUND BY USING TEMP TABLE
> t = sqlContext.sql("""
>   SELECT a.i, a.x
>   FROM a
>   JOIN b ON b.i = a.i
> """).cache()
> # trigger computation
> t.head()
> t.registerTempTable('t')
> sqlContext.sql("""
>   SELECT i, x,
> ROW_NUMBER() OVER (
>   PARTITION BY i ORDER BY x) AS row_num
>   FROM t
> """).show()
> # +---++---+
> # |  i|   x|row_num|
> # +---++---+
> # |  1|  0.2483285619632939|  1|
> # |  1|  0.3378351523586306|  2|
> # |  1|  0.4796752841655936|  3|
> # |  1|  0.6684483939068196|  4|
> # |  1|  0.8717439935587555|  5|
> # |  2|  0.2199359901600595|  1|
> # |  2| 0.24823688829578183|  2|
> # |  2|  0.2971739640384895|  3|
> # |  2|  0.4646004597998037|  4|
> # |  2|  0.5914212915574378|  5|
> # |  3|0.010912835935112164|  1|

[jira] [Created] (SPARK-12012) Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan

2015-11-26 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-12012:
--

 Summary: Show more comprehensive PhysicalRDD metadata when 
visualizing SQL query plan
 Key: SPARK-12012
 URL: https://issues.apache.org/jira/browse/SPARK-12012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.7.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11551) Replace example code in ml-features.md using include_example

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028647#comment-15028647
 ] 

Apache Spark commented on SPARK-11551:
--

User 'somideshmukh' has created a pull request for this issue:
https://github.com/apache/spark/pull/10002

> Replace example code in ml-features.md using include_example
> 
>
> Key: SPARK-11551
> URL: https://issues.apache.org/jira/browse/SPARK-11551
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12013) Add a Hive context method to retrieve the database Names

2015-11-26 Thread Jose Antonio (JIRA)
Jose Antonio created SPARK-12013:


 Summary: Add a Hive context method to retrieve the database Names
 Key: SPARK-12013
 URL: https://issues.apache.org/jira/browse/SPARK-12013
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.2
Reporter: Jose Antonio
 Fix For: 1.6.0


As the HiveContext method tableNames() add a new method to retrieve the 
database names.

Currently one hace to make a show databases  and  get a dataframe.colllect()

This is very slow since it can be queried quickly just by asking the Hive 
metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.

2015-11-26 Thread Meethu Mathew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028657#comment-15028657
 ] 

Meethu Mathew commented on SPARK-2572:
--

[~srowen] We are facing this issue with  Mesos fine grained mode in Spark 
1.4.1. The /tmp/spark-* and and some blockmgr-* files exist even after calling 
sc.stop(). Is there any any other way to solve this issue?

> Can't delete local dir on executor automatically when running spark over 
> Mesos.
> ---
>
> Key: SPARK-2572
> URL: https://issues.apache.org/jira/browse/SPARK-2572
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Yadong Qi
>Priority: Minor
>
> When running spark over Mesos in “fine-grained” modes or “coarse-grained” 
> mode. After the application finished.The local 
> dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete 
> automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028672#comment-15028672
 ] 

Apache Spark commented on SPARK-12010:
--

User 'CK50' has created a pull request for this issue:
https://github.com/apache/spark/pull/10003

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12010:


Assignee: (was: Apache Spark)

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12010:


Assignee: Apache Spark

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12012) Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12012:


Assignee: Apache Spark  (was: Cheng Lian)

> Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan
> 
>
> Key: SPARK-12012
> URL: https://issues.apache.org/jira/browse/SPARK-12012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.7.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12012) Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12012:


Assignee: Cheng Lian  (was: Apache Spark)

> Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan
> 
>
> Key: SPARK-12012
> URL: https://issues.apache.org/jira/browse/SPARK-12012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12012) Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028679#comment-15028679
 ] 

Apache Spark commented on SPARK-12012:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10004

> Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan
> 
>
> Key: SPARK-12012
> URL: https://issues.apache.org/jira/browse/SPARK-12012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12012) Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan

2015-11-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12012:
---
Description: Currently a {{PhysicalRDD}} operator is just visualized as a 
node with nothing but a label named "PhysicalRDD", all the detail information 
is only shown in the tooltip, which can be inconvenient.

> Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan
> 
>
> Key: SPARK-12012
> URL: https://issues.apache.org/jira/browse/SPARK-12012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently a {{PhysicalRDD}} operator is just visualized as a node with 
> nothing but a label named "PhysicalRDD", all the detail information is only 
> shown in the tooltip, which can be inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12014) Spark SQL query containing semicolon is broken in Beeline (related to HIVE-11100)

2015-11-26 Thread Teng Qiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Qiu updated SPARK-12014:
-
Description: 
Actually it is known hive issue: 
https://issues.apache.org/jira/browse/HIVE-11100

patch available: https://reviews.apache.org/r/35907/diff/1

but Spark uses its own maven dependencies for hive (org.spark-project.hive), we 
can not use this patch to fix the problem, it would be better if you can fix 
this in spark's hive package.

In spark's beeline, the error message will be:

{code}
0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 87 (state=,code=0)

0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 88 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,';','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 30 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,'\;','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 31 (state=,code=0)
{code}

  was:
Actually it is known hive issue: 
https://issues.apache.org/jira/browse/HIVE-11100

patch available: https://reviews.apache.org/r/35907/diff/1

but Spark uses its own hive maven dependencies for hive 
(org.spark-project.hive), we can not use this patch to fix the problem, it 
would be better if you can fix this in spark's hive package.

In spark's beeline, the error message will be:

{code}
0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 87 (state=,code=0)

0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 88 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,';','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 30 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,'\;','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 31 (state=,code=0)
{code}


> Spark SQL query containing semicolon is broken in Beeline (related to 
> HIVE-11100)
> -
>
> Key: SPARK-12014
> URL: https://issues.apache.org/jira/browse/SPARK-12014
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Teng Qiu
>Priority: Minor
>
> Actually it is known hive issue: 
> https://issues.apache.org/jira/browse/HIVE-11100
> patch available: https://reviews.apache.org/r/35907/diff/1
> but Spark uses its own maven dependencies for hive (org.spark-project.hive), 
> we can not use this patch to fix the problem, it would be better if you can 
> fix this in spark's hive package.
> In spark's beeline, the error message will be:
> {code}
> 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
> Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
> expecting StringLiteral near 'BY' in table row format's field separator; line 
> 1 pos 87 (state=,code=0)
> 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
> Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
> expecting StringLiteral near 'BY' in table row format's field separator; line 
> 1 pos 88 (state=,code=0)
> 0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,';','=')['u2'] FROM 
> some_logs WHERE log_date = '20151125' limit 5;
> Error

[jira] [Created] (SPARK-12014) Spark SQL query containing semicolon is broken in Beeline (related to HIVE-11100)

2015-11-26 Thread Teng Qiu (JIRA)
Teng Qiu created SPARK-12014:


 Summary: Spark SQL query containing semicolon is broken in Beeline 
(related to HIVE-11100)
 Key: SPARK-12014
 URL: https://issues.apache.org/jira/browse/SPARK-12014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Teng Qiu
Priority: Minor


Actually it is known hive issue: 
https://issues.apache.org/jira/browse/HIVE-11100

patch available: https://reviews.apache.org/r/35907/diff/1

but Spark uses its own hive maven dependencies for hive 
(org.spark-project.hive), we can not use this patch to fix the problem, it 
would be better if you can fix this in spark's hive package.

In spark's beeline, the error message will be:

{code}
0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 87 (state=,code=0)

0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 88 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,';','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 30 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,'\;','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 31 (state=,code=0)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12014) Spark SQL query containing semicolon is broken in Beeline (related to HIVE-11100)

2015-11-26 Thread Teng Qiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Qiu updated SPARK-12014:
-
Description: 
Actually it is known hive issue: 
https://issues.apache.org/jira/browse/HIVE-11100

patch available: https://reviews.apache.org/r/35907/diff/1

but Spark uses its own maven dependencies for hive (org.spark-project.hive), we 
can not use this patch to fix the problem, it would be better if you can fix 
this in spark's hive package.

In spark's beeline, the error message will be:

{code}
0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 87 (state=,code=0)

0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 88 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,';','=')['key_name'] 
FROM some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 30 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,'\;','=')['key_name'] 
FROM some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 31 (state=,code=0)
{code}

  was:
Actually it is known hive issue: 
https://issues.apache.org/jira/browse/HIVE-11100

patch available: https://reviews.apache.org/r/35907/diff/1

but Spark uses its own maven dependencies for hive (org.spark-project.hive), we 
can not use this patch to fix the problem, it would be better if you can fix 
this in spark's hive package.

In spark's beeline, the error message will be:

{code}
0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 87 (state=,code=0)

0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
expecting StringLiteral near 'BY' in table row format's field separator; line 1 
pos 88 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,';','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 30 (state=,code=0)

0: jdbc:hive2://host:1/> SELECT str_to_map(other_data,'\;','=')['u2'] FROM 
some_logs WHERE log_date = '20151125' limit 5;
Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
'' '' '' in select expression; line 1 pos 31 (state=,code=0)
{code}


> Spark SQL query containing semicolon is broken in Beeline (related to 
> HIVE-11100)
> -
>
> Key: SPARK-12014
> URL: https://issues.apache.org/jira/browse/SPARK-12014
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Teng Qiu
>Priority: Minor
>
> Actually it is known hive issue: 
> https://issues.apache.org/jira/browse/HIVE-11100
> patch available: https://reviews.apache.org/r/35907/diff/1
> but Spark uses its own maven dependencies for hive (org.spark-project.hive), 
> we can not use this patch to fix the problem, it would be better if you can 
> fix this in spark's hive package.
> In spark's beeline, the error message will be:
> {code}
> 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
> Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
> expecting StringLiteral near 'BY' in table row format's field separator; line 
> 1 pos 87 (state=,code=0)
> 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
> Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
> expecting StringLiteral near 'BY' in table row format's field separator; line 
> 1 pos 88 (state=,code=0)
> 0: jdbc:hive2://host:1/> SELECT 
> str_to_map(other_data,';','=')['key_name'] FROM some_logs WHERE log_date = 
> '20151125' 

[jira] [Assigned] (SPARK-11960) User guide section for streaming a/b testing

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11960:


Assignee: Apache Spark  (was: Feynman Liang)

> User guide section for streaming a/b testing
> 
>
> Key: SPARK-11960
> URL: https://issues.apache.org/jira/browse/SPARK-11960
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> [~fliang] Assigning since you added the feature.  Will you have a chance to 
> do this soon?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11960) User guide section for streaming a/b testing

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028801#comment-15028801
 ] 

Apache Spark commented on SPARK-11960:
--

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10005

> User guide section for streaming a/b testing
> 
>
> Key: SPARK-11960
> URL: https://issues.apache.org/jira/browse/SPARK-11960
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Feynman Liang
>
> [~fliang] Assigning since you added the feature.  Will you have a chance to 
> do this soon?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11960) User guide section for streaming a/b testing

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11960:


Assignee: Feynman Liang  (was: Apache Spark)

> User guide section for streaming a/b testing
> 
>
> Key: SPARK-11960
> URL: https://issues.apache.org/jira/browse/SPARK-11960
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Feynman Liang
>
> [~fliang] Assigning since you added the feature.  Will you have a chance to 
> do this soon?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11405) ROW_NUMBER function does not adhere to window ORDER BY, when joining

2015-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11405:
--
Assignee: Josh Rosen

> ROW_NUMBER function does not adhere to window ORDER BY, when joining
> 
>
> Key: SPARK-11405
> URL: https://issues.apache.org/jira/browse/SPARK-11405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: YARN
>Reporter: Jarno Seppanen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.5.2
>
>
> The following query produces incorrect results:
> {code:sql}
> sqlContext.sql("""
>   SELECT a.i, a.x,
> ROW_NUMBER() OVER (
>   PARTITION BY a.i ORDER BY a.x) AS row_num
>   FROM a
>   JOIN b ON b.i = a.i
> """).show()
> +---++---+
> |  i|   x|row_num|
> +---++---+
> |  1|  0.8717439935587555|  1|
> |  1|  0.6684483939068196|  2|
> |  1|  0.3378351523586306|  3|
> |  1|  0.2483285619632939|  4|
> |  1|  0.4796752841655936|  5|
> |  2|  0.2971739640384895|  1|
> |  2|  0.2199359901600595|  2|
> |  2|  0.4646004597998037|  3|
> |  2| 0.24823688829578183|  4|
> |  2|  0.5914212915574378|  5|
> |  3|0.010912835935112164|  1|
> |  3|  0.6520139509583123|  2|
> |  3|  0.8571994559240592|  3|
> |  3|  0.1122635843020473|  4|
> |  3| 0.45913022936460457|  5|
> +---++---+
> {code}
> The row number doesn't follow the correct order. The join seems to break the 
> order, ROW_NUMBER() works correctly if the join results are saved to a 
> temporary table, and a second query is made.
> Here's a small PySpark test case to reproduce the error:
> {code}
> from pyspark.sql import Row
> import random
> a = sc.parallelize([Row(i=i, x=random.random())
> for i in range(5)
> for j in range(5)])
> b = sc.parallelize([Row(i=i) for i in [1, 2, 3]])
> af = sqlContext.createDataFrame(a)
> bf = sqlContext.createDataFrame(b)
> af.registerTempTable('a')
> bf.registerTempTable('b')
> af.show()
> # +---++
> # |  i|   x|
> # +---++
> # |  0| 0.12978974167478896|
> # |  0|  0.7105927498584452|
> # |  0| 0.21225679077448045|
> # |  0| 0.03849717391728036|
> # |  0|  0.4976622146442401|
> # |  1|  0.4796752841655936|
> # |  1|  0.8717439935587555|
> # |  1|  0.6684483939068196|
> # |  1|  0.3378351523586306|
> # |  1|  0.2483285619632939|
> # |  2|  0.2971739640384895|
> # |  2|  0.2199359901600595|
> # |  2|  0.5914212915574378|
> # |  2| 0.24823688829578183|
> # |  2|  0.4646004597998037|
> # |  3|  0.1122635843020473|
> # |  3|  0.6520139509583123|
> # |  3| 0.45913022936460457|
> # |  3|0.010912835935112164|
> # |  3|  0.8571994559240592|
> # +---++
> # only showing top 20 rows
> bf.show()
> # +---+
> # |  i|
> # +---+
> # |  1|
> # |  2|
> # |  3|
> # +---+
> ### WRONG
> sqlContext.sql("""
>   SELECT a.i, a.x,
> ROW_NUMBER() OVER (
>   PARTITION BY a.i ORDER BY a.x) AS row_num
>   FROM a
>   JOIN b ON b.i = a.i
> """).show()
> # +---++---+
> # |  i|   x|row_num|
> # +---++---+
> # |  1|  0.8717439935587555|  1|
> # |  1|  0.6684483939068196|  2|
> # |  1|  0.3378351523586306|  3|
> # |  1|  0.2483285619632939|  4|
> # |  1|  0.4796752841655936|  5|
> # |  2|  0.2971739640384895|  1|
> # |  2|  0.2199359901600595|  2|
> # |  2|  0.4646004597998037|  3|
> # |  2| 0.24823688829578183|  4|
> # |  2|  0.5914212915574378|  5|
> # |  3|0.010912835935112164|  1|
> # |  3|  0.6520139509583123|  2|
> # |  3|  0.8571994559240592|  3|
> # |  3|  0.1122635843020473|  4|
> # |  3| 0.45913022936460457|  5|
> # +---++---+
> ### WORKAROUND BY USING TEMP TABLE
> t = sqlContext.sql("""
>   SELECT a.i, a.x
>   FROM a
>   JOIN b ON b.i = a.i
> """).cache()
> # trigger computation
> t.head()
> t.registerTempTable('t')
> sqlContext.sql("""
>   SELECT i, x,
> ROW_NUMBER() OVER (
>   PARTITION BY i ORDER BY x) AS row_num
>   FROM t
> """).show()
> # +---++---+
> # |  i|   x|row_num|
> # +---++---+
> # |  1|  0.2483285619632939|  1|
> # |  1|  0.3378351523586306|  2|
> # |  1|  0.4796752841655936|  3|
> # |  1|  0.6684483939068196|  4|
> # |  1|  0.8717439935587555|  5|
> # |  2|  0.2199359901600595|  1|
> # |  2| 0.24823688829578183|  2|
> # |  2|  0.2971739640384895|  3|
> # |  2|  0.4646004597998037|  4|
> # |  2|  0.5914212915574378|  5|
> # |  3|0.010912835935112164|  1|
> # |  3|  0.1122635

[jira] [Updated] (SPARK-12013) Add a Hive context method to retrieve the database Names

2015-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12013:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 1.6.0)

[~jamartinh] don't set target/fix version. Is this really that slow? it doesn't 
seem like a common operation and so don't know if it is worth adding another 
operation to do the same thing.

> Add a Hive context method to retrieve the database Names
> 
>
> Key: SPARK-12013
> URL: https://issues.apache.org/jira/browse/SPARK-12013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jose Antonio
>Priority: Minor
>  Labels: features, sql
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> As the HiveContext method tableNames() add a new method to retrieve the 
> database names.
> Currently one hace to make a show databases  and  get a dataframe.colllect()
> This is very slow since it can be queried quickly just by asking the Hive 
> metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12015) Auto convert int to Double when required in pyspark.ml

2015-11-26 Thread Jose Antonio (JIRA)
Jose Antonio created SPARK-12015:


 Summary: Auto convert int to Double when required in pyspark.ml
 Key: SPARK-12015
 URL: https://issues.apache.org/jira/browse/SPARK-12015
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 1.5.2
Reporter: Jose Antonio
 Fix For: 1.6.0


I have received the following exception:

Why we cannot pass an integer when the required parameter is Double?

I think this should be upcasted silently.

java.lang.IllegalArgumentException: requirement failed: Column label must be of 
type DoubleType but was actually IntegerType.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at 
org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at 
org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:56)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11958) Create user guide section for SQLTransformer

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028856#comment-15028856
 ] 

Apache Spark commented on SPARK-11958:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10006

> Create user guide section for SQLTransformer
> 
>
> Key: SPARK-11958
> URL: https://issues.apache.org/jira/browse/SPARK-11958
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> In ml-features.md
> Code examples could be put under the code examples folder as well (with 
> excerpts used in the guide).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11958) Create user guide section for SQLTransformer

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11958:


Assignee: (was: Apache Spark)

> Create user guide section for SQLTransformer
> 
>
> Key: SPARK-11958
> URL: https://issues.apache.org/jira/browse/SPARK-11958
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> In ml-features.md
> Code examples could be put under the code examples folder as well (with 
> excerpts used in the guide).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11958) Create user guide section for SQLTransformer

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11958:


Assignee: Apache Spark

> Create user guide section for SQLTransformer
> 
>
> Key: SPARK-11958
> URL: https://issues.apache.org/jira/browse/SPARK-11958
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> In ml-features.md
> Code examples could be put under the code examples folder as well (with 
> excerpts used in the guide).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11957) SQLTransformer docs are unclear about generality of SQL statements

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11957:


Assignee: Apache Spark

> SQLTransformer docs are unclear about generality of SQL statements
> --
>
> Key: SPARK-11957
> URL: https://issues.apache.org/jira/browse/SPARK-11957
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> See discussion here for context [SPARK-11234].  The Scala doc needs to be 
> clearer about what SQL statements are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11957) SQLTransformer docs are unclear about generality of SQL statements

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11957:


Assignee: (was: Apache Spark)

> SQLTransformer docs are unclear about generality of SQL statements
> --
>
> Key: SPARK-11957
> URL: https://issues.apache.org/jira/browse/SPARK-11957
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See discussion here for context [SPARK-11234].  The Scala doc needs to be 
> clearer about what SQL statements are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11957) SQLTransformer docs are unclear about generality of SQL statements

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028857#comment-15028857
 ] 

Apache Spark commented on SPARK-11957:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10006

> SQLTransformer docs are unclear about generality of SQL statements
> --
>
> Key: SPARK-11957
> URL: https://issues.apache.org/jira/browse/SPARK-11957
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See discussion here for context [SPARK-11234].  The Scala doc needs to be 
> clearer about what SQL statements are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12016) word2vec load model can't use findSynonyms to get words

2015-11-26 Thread yuangang.liu (JIRA)
yuangang.liu created SPARK-12016:


 Summary: word2vec load model can't use findSynonyms to get words 
 Key: SPARK-12016
 URL: https://issues.apache.org/jira/browse/SPARK-12016
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2
 Environment: ubuntu 14.04
Reporter: yuangang.liu


I use word2vec.fit to train a word2vecModel and then save the model to file 
system. when I load the model from file system, I found I can use 
transform('a') to get a vector, but I can't use findSynonyms('a', 2) to get 
some words.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7122) KafkaUtils.createDirectStream - unreasonable processing time in absence of load

2015-11-26 Thread Nicolas PHUNG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028966#comment-15028966
 ] 

Nicolas PHUNG commented on SPARK-7122:
--

I'm confirming it works now in Spark 1.5.2

> KafkaUtils.createDirectStream - unreasonable processing time in absence of 
> load
> ---
>
> Key: SPARK-7122
> URL: https://issues.apache.org/jira/browse/SPARK-7122
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.3.1
> Environment: Spark Streaming 1.3.1, standalone mode running on just 1 
> box: Ubuntu 14.04.2 LTS, 4 cores, 8GB RAM, java version "1.8.0_40"
>Reporter: Platon Potapov
>Priority: Minor
> Attachments: 10.second.window.fast.job.txt, 
> 5.second.window.slow.job.txt, SparkStreamingJob.scala
>
>
> attached is the complete source code of a test spark job. no external data 
> generators are run - just the presence of a kafka topic named "raw" suffices.
> the spark job is run with no load whatsoever. http://localhost:4040/streaming 
> is checked to obtain job processing duration.
> * in case the test contains the following transformation:
> {code}
> // dummy transformation
> val temperature = bytes.filter(_._1 == "abc")
> val abc = temperature.window(Seconds(40), Seconds(5))
> abc.print()
> {code}
> the median processing time is 3 seconds 80 ms
> * in case the test contains the following transformation:
> {code}
> // dummy transformation
> val temperature = bytes.filter(_._1 == "abc")
> val abc = temperature.map(x => (1, x))
> abc.print()
> {code}
> the median processing time is just 50 ms
> please explain why does the "window" transformation introduce such a growth 
> of job duration?
> note: the result is the same regardless of the number of kafka topic 
> partitions (I've tried 1 and 8)
> note2: the result is the same regardless of the window parameters (I've tried 
> (20, 2) and (40, 5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11987) Python API update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11987:
--
Description: Add Python APIs for QuantileDiscretizer and ChiSqSelector in 
the ML package.  (was: Add Python APIs for QuantileDiscretizer and 
ChiSqSelector in the ML package. ~~Then add Python APIs to the programming 
guide.~~)

> Python API update for ChiSqSelector and QuantileDiscretizer
> ---
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11987) Python API update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11987:
--
Description: Add Python APIs for QuantileDiscretizer and ChiSqSelector in 
the ML package. ~~Then add Python APIs to the programming guide.~~  (was: Add 
Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package. Then 
add Python APIs to the programming guide.)

> Python API update for ChiSqSelector and QuantileDiscretizer
> ---
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package. 
> ~~Then add Python APIs to the programming guide.~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11987) Python API update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11987:
--
Summary: Python API update for ChiSqSelector and QuantileDiscretizer  (was: 
Python API/Programming guide update for ChiSqSelector and QuantileDiscretizer)

> Python API update for ChiSqSelector and QuantileDiscretizer
> ---
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package. 
> Then add Python APIs to the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11987) Python API update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029011#comment-15029011
 ] 

Xusen Yin commented on SPARK-11987:
---

I remove the API doc job. Let's postpone it before the release of v1.7.

> Python API update for ChiSqSelector and QuantileDiscretizer
> ---
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11987) Python API update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11987:


Assignee: (was: Apache Spark)

> Python API update for ChiSqSelector and QuantileDiscretizer
> ---
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11987) Python API update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11987:


Assignee: Apache Spark

> Python API update for ChiSqSelector and QuantileDiscretizer
> ---
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Assignee: Apache Spark
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11987) Python API update for ChiSqSelector and QuantileDiscretizer

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029024#comment-15029024
 ] 

Apache Spark commented on SPARK-11987:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10007

> Python API update for ChiSqSelector and QuantileDiscretizer
> ---
>
> Key: SPARK-11987
> URL: https://issues.apache.org/jira/browse/SPARK-11987
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add Python APIs for QuantileDiscretizer and ChiSqSelector in the ML package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12002) offsetRanges attribute missing in Kafka RDD when resuming from checkpoint

2015-11-26 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029058#comment-15029058
 ] 

Cody Koeninger commented on SPARK-12002:


Just to be clear, this is a only an issue with python?

> offsetRanges attribute missing in Kafka RDD when resuming from checkpoint
> -
>
> Key: SPARK-12002
> URL: https://issues.apache.org/jira/browse/SPARK-12002
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Reporter: Amit Ramesh
>
> SPARK-8389 added offsetRanges to Kafka direct streams. And SPARK-10122 fixed 
> the issue of not ending up with non-Kafka RDDs when chaining transforms to 
> Kafka RDDs. It appears that this issue remains for the case where a streaming 
> application using Kafka direct streams is initialized from the checkpoint 
> directory. The following is a representative example where everything works 
> as expected during the first run, but exceptions are thrown on a subsequent 
> run when the context is being initialized from the checkpoint directory.
> {code:title=test_checkpoint.py|language=python}
> from pyspark import SparkContext  
>   
> from pyspark.streaming import StreamingContext
>   
> from pyspark.streaming.kafka import KafkaUtils
>   
> def attach_kafka_metadata(kafka_rdd): 
>   
> offset_ranges = kafka_rdd.offsetRanges()  
>   
>   
>   
> return kafka_rdd  
>   
>   
>   
>   
>   
> def create_context(): 
>   
> sc = SparkContext(appName='kafka-test')   
>   
> ssc = StreamingContext(sc, 10)
>   
> ssc.checkpoint(CHECKPOINT_URI)
>   
>   
>   
> kafka_stream = KafkaUtils.createDirectStream( 
>   
> ssc,  
>   
> [TOPIC],  
>   
> kafkaParams={ 
>   
> 'metadata.broker.list': BROKERS,  
>   
> },
>   
> ) 
>   
> kafka_stream.transform(attach_kafka_metadata).count().pprint()
>   
>   
>   
> return ssc
>   
>   
>   
>   
>   
> if __name__ == "__main__":
>   
> ssc = StreamingContext.getOrCreate(CHECKPOINT_URI, create_context)
>   
> ssc.start()   

[jira] [Commented] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-11-26 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029073#comment-15029073
 ] 

yuhao yang commented on SPARK-12000:


Met that with "./build/sbt unidoc" 

no-symbol does not have an owner
at scala.reflect.internal.SymbolTable.abort(SymbolTable.scala:49)
at scala.tools.nsc.Global.abort(Global.scala:254)
at scala.reflect.internal.Symbols$NoSymbol.owner(Symbols.scala:3257)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.addEnclosingTParams(ClassfileParser.scala:585)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:530)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:88)
at 
scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:261)
at 
scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:194)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.cleanupBogusClasses$1(MemberLookupBase.scala:153)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInTemplate(MemberLookupBase.scala:164)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.scala$tools$nsc$doc$base$MemberLookupBase$$lookupInTemplate(MemberLookupBase.scala:128)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInRootPackage(MemberLookupBase.scala:115)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.memberLookup(MemberLookupBase.scala:52)
at 
scala.tools.nsc.doc.DocFactory$$anon$1.memberLookup(DocFactory.scala:78)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link$lzycompute(MemberLookupBase.scala:27)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link(MemberLookupBase.scala:27)
at scala.tools.nsc.doc.base.comment.EntityLink$.unapply(Body.scala:75)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:126)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:124)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115)
at scala.tools.nsc.doc.html.HtmlPage.blockToHtml(HtmlPage.scala:89)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$bodyToHtml$1.apply(Ht

[jira] [Commented] (SPARK-12013) Add a Hive context method to retrieve the database Names

2015-11-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029087#comment-15029087
 ] 

Xiao Li commented on SPARK-12013:
-

I can work on it. Thanks!

> Add a Hive context method to retrieve the database Names
> 
>
> Key: SPARK-12013
> URL: https://issues.apache.org/jira/browse/SPARK-12013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jose Antonio
>Priority: Minor
>  Labels: features, sql
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> As the HiveContext method tableNames() add a new method to retrieve the 
> database names.
> Currently one hace to make a show databases  and  get a dataframe.colllect()
> This is very slow since it can be queried quickly just by asking the Hive 
> metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-11-26 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029073#comment-15029073
 ] 

yuhao yang edited comment on SPARK-12000 at 11/26/15 4:43 PM:
--

Met it with "./build/sbt unidoc" 

This is blocking 
11602 scala API, doc
11605 ML 1.6 QA: API: Java compatibility, docs

Is there a way to generate docs for MLlib successfully now ? Thanks.



was (Author: yuhaoyan):
Met that with "./build/sbt unidoc" 

This is blocking 
11602 scala API, doc
11605 ML 1.6 QA: API: Java compatibility, docs

Is there a way to generate docs for MLlib successfully now ? Thanks.


> `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
> -
>
> Key: SPARK-12000
> URL: https://issues.apache.org/jira/browse/SPARK-12000
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Reported by [~josephkb]. Not sure what is the root cause, but this is the 
> error message when I ran "sbt publishLocal":
> {code}
> [error] (launcher/compile:doc) javadoc returned nonzero exit code
> [error] (mllib/compile:doc) scala.reflect.internal.FatalError:
> [error]  while compiling: 
> /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala
> [error] during phase: global=terminal, atPhase=parser
> [error]  library version: version 2.10.5
> [error] compiler version: version 2.10.5
> [error]   reconstructed args: -Yno-self-type-checks -groups -classpath 
> /Users/meng/src/spark/core/target/scala-2.10/classes:/Users/meng/src/spark/launcher/target/scala-2.10/classes:/Users/meng/src/spark/network/common/target/scala-2.10/classes:/Users/meng/src/spark/network/shuffle/target/scala-2.10/classes:/Users/meng/src/spark/unsafe/target/scala-2.10/classes:/Users/meng/src/spark/streaming/target/scala-2.10/classes:/Users/meng/src/spark/sql/core/target/scala-2.10/classes:/Users/meng/src/spark/sql/catalyst/target/scala-2.10/classes:/Users/meng/src/spark/graphx/target/scala-2.10/classes:/Users/meng/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:/Users/meng/.ivy2/cache/com.google.guava/guava/bundles/guava-14.0.1.jar:/Users/meng/.ivy2/cache/io.netty/netty-all/jars/netty-all-4.0.29.Final.jar:/Users/meng/.ivy2/cache/org.fusesource.leveldbjni/leveldbjni-all/bundles/leveldbjni-all-1.8.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-databind/bundles/jackson-databind-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-annotations/bundles/jackson-annotations-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-core/bundles/jackson-core-2.4.4.jar:/Users/meng/.ivy2/cache/com.twitter/chill_2.10/jars/chill_2.10-0.5.0.jar:/Users/meng/.ivy2/cache/com.twitter/chill-java/jars/chill-java-0.5.0.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.reflectasm/reflectasm/jars/reflectasm-1.07-shaded.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:/Users/meng/.ivy2/cache/org.objenesis/objenesis/jars/objenesis-1.2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-mapred/jars/avro-mapred-1.7.7-hadoop2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7-tests.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro/jars/avro-1.7.7.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-core-asl/jars/jackson-core-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-mapper-asl/jars/jackson-mapper-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-compress/jars/commons-compress-1.4.1.jar:/Users/meng/.ivy2/cache/org.tukaani/xz/jars/xz-1.0.jar:/Users/meng/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.10.jar:/Users/meng/.ivy2/cache/org.apache.xbean/xbean-asm5-shaded/bundles/xbean-asm5-shaded-4.4.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-client/jars/hadoop-client-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-common/jars/hadoop-common-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-annotations/jars/hadoop-annotations-2.2.0.jar:/Users/meng/.ivy2/cache/commons-cli/commons-cli/jars/commons-cli-1.2.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-math/jars/commons-math-2.1.jar:/Users/meng/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/Users/meng/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar:/Users/meng/.ivy2/cache/commons-net/commons-net/jars/commons-net-3.1.jar:/Users/meng/.ivy2/cache/log4j/log4j/bundles/log4j-1.2.17.jar:/Users/meng/.ivy2/cache/commons-lang/commons-la

[jira] [Comment Edited] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-11-26 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029073#comment-15029073
 ] 

yuhao yang edited comment on SPARK-12000 at 11/26/15 4:43 PM:
--

Met that with "./build/sbt unidoc" 

This is blocking 
11602 scala API, doc
11605 ML 1.6 QA: API: Java compatibility, docs

Is there a way to generate docs for MLlib successfully now ? Thanks.



was (Author: yuhaoyan):
Met that with "./build/sbt unidoc" 

no-symbol does not have an owner
at scala.reflect.internal.SymbolTable.abort(SymbolTable.scala:49)
at scala.tools.nsc.Global.abort(Global.scala:254)
at scala.reflect.internal.Symbols$NoSymbol.owner(Symbols.scala:3257)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.addEnclosingTParams(ClassfileParser.scala:585)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:530)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:88)
at 
scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:261)
at 
scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:194)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.cleanupBogusClasses$1(MemberLookupBase.scala:153)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInTemplate(MemberLookupBase.scala:164)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.scala$tools$nsc$doc$base$MemberLookupBase$$lookupInTemplate(MemberLookupBase.scala:128)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInRootPackage(MemberLookupBase.scala:115)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.memberLookup(MemberLookupBase.scala:52)
at 
scala.tools.nsc.doc.DocFactory$$anon$1.memberLookup(DocFactory.scala:78)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link$lzycompute(MemberLookupBase.scala:27)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link(MemberLookupBase.scala:27)
at scala.tools.nsc.doc.base.comment.EntityLink$.unapply(Body.scala:75)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:126)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:124)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTr

[jira] [Commented] (SPARK-12013) Add a Hive context method to retrieve the database Names

2015-11-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029114#comment-15029114
 ] 

Xiao Li commented on SPARK-12013:
-

[~getaceres] Hive does not provide such an interface. Thus, the implementation 
is just to do the exactly same things you did. 

> Add a Hive context method to retrieve the database Names
> 
>
> Key: SPARK-12013
> URL: https://issues.apache.org/jira/browse/SPARK-12013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jose Antonio
>Priority: Minor
>  Labels: features, sql
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> As the HiveContext method tableNames() add a new method to retrieve the 
> database names.
> Currently one hace to make a show databases  and  get a dataframe.colllect()
> This is very slow since it can be queried quickly just by asking the Hive 
> metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12013) Add a Hive context method to retrieve the database Names

2015-11-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029114#comment-15029114
 ] 

Xiao Li edited comment on SPARK-12013 at 11/26/15 5:07 PM:
---

[~getaceres] Hive does not provide such an interface. Thus, the implementation 
is just to do the exactly same things you did. Is that what you want?


was (Author: smilegator):
[~getaceres] Hive does not provide such an interface. Thus, the implementation 
is just to do the exactly same things you did. 

> Add a Hive context method to retrieve the database Names
> 
>
> Key: SPARK-12013
> URL: https://issues.apache.org/jira/browse/SPARK-12013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jose Antonio
>Priority: Minor
>  Labels: features, sql
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> As the HiveContext method tableNames() add a new method to retrieve the 
> database names.
> Currently one hace to make a show databases  and  get a dataframe.colllect()
> This is very slow since it can be queried quickly just by asking the Hive 
> metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12016) word2vec load model can't use findSynonyms to get words

2015-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029129#comment-15029129
 ] 

Sean Owen commented on SPARK-12016:
---

This isn't nearly enough info. Please provide more detail like a reproduction 
or close this.

> word2vec load model can't use findSynonyms to get words 
> 
>
> Key: SPARK-12016
> URL: https://issues.apache.org/jira/browse/SPARK-12016
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: ubuntu 14.04
>Reporter: yuangang.liu
>
> I use word2vec.fit to train a word2vecModel and then save the model to file 
> system. when I load the model from file system, I found I can use 
> transform('a') to get a vector, but I can't use findSynonyms('a', 2) to get 
> some words.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.

2015-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029131#comment-15029131
 ] 

Sean Owen commented on SPARK-2572:
--

Try 1.6/master? I don't know about Mesos, myself.

> Can't delete local dir on executor automatically when running spark over 
> Mesos.
> ---
>
> Key: SPARK-2572
> URL: https://issues.apache.org/jira/browse/SPARK-2572
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Yadong Qi
>Priority: Minor
>
> When running spark over Mesos in “fine-grained” modes or “coarse-grained” 
> mode. After the application finished.The local 
> dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete 
> automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11990) DataFrame recompute UDF in some situation.

2015-11-26 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11990.
--
   Resolution: Duplicate
Fix Version/s: 1.6.0

This is already fixed in Spark 1.6 by [SPARK-10371].

> DataFrame recompute UDF in some situation.
> --
>
> Key: SPARK-11990
> URL: https://issues.apache.org/jira/browse/SPARK-11990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Yi Tian
> Fix For: 1.6.0
>
>
> Here is codes for reproducing this problem:
> {code}
>   val mkArrayUDF = org.apache.spark.sql.functions.udf[Array[String],String] 
> ((s: String) => {
> println("udf called")
> Array[String](s+"_part1", s+"_part2")
>   })
>   
>   val df = sc.parallelize(Seq(("a"))).toDF("a")
>   val df2 = df.withColumn("arr",mkArrayUDF(df("a")))
>   val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", 
> df2("arr")(1))
>   df3.collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12013) Add a Hive context method to retrieve the database Names

2015-11-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029140#comment-15029140
 ] 

Xiao Li commented on SPARK-12013:
-

I am not sure whether we should add such an interface. Let [~rxin] [~marmbrus] 
give a suggestion.  

IMO, the simplest way is to run the following code. It will return a sequence. 
If needed, you can convert it to an array, by calling toArray
hiveContext.runSqlHive("SHOW DATABASES")




> Add a Hive context method to retrieve the database Names
> 
>
> Key: SPARK-12013
> URL: https://issues.apache.org/jira/browse/SPARK-12013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jose Antonio
>Priority: Minor
>  Labels: features, sql
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> As the HiveContext method tableNames() add a new method to retrieve the 
> database names.
> Currently one hace to make a show databases  and  get a dataframe.colllect()
> This is very slow since it can be queried quickly just by asking the Hive 
> metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12017) Java Doc Publishing Broken

2015-11-26 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-12017:


 Summary: Java Doc Publishing Broken
 Key: SPARK-12017
 URL: https://issues.apache.org/jira/browse/SPARK-12017
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Michael Armbrust
Priority: Blocker


The java docs are missing from the 1.6 preview.  I think that 
[this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
 is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12017) Java Doc Publishing Broken

2015-11-26 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029153#comment-15029153
 ] 

Michael Armbrust commented on SPARK-12017:
--

/cc [~joshrosen]

> Java Doc Publishing Broken
> --
>
> Key: SPARK-12017
> URL: https://issues.apache.org/jira/browse/SPARK-12017
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Michael Armbrust
>Priority: Blocker
>
> The java docs are missing from the 1.6 preview.  I think that 
> [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
>  is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11373) Add metrics to the History Server and providers

2015-11-26 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11373:
---
Affects Version/s: (was: 1.5.1)
   1.6.0

> Add metrics to the History Server and providers
> ---
>
> Key: SPARK-11373
> URL: https://issues.apache.org/jira/browse/SPARK-11373
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>
> The History server doesn't publish metrics about JVM load or anything from 
> the history provider plugins. This means that performance problems from 
> massive job histories aren't visible to management tools, and nor are any 
> provider-generated metrics such as time to load histories, failed history 
> loads, the number of connectivity failures talking to remote services, etc.
> If the history server set up a metrics registry and offered the option to 
> publish its metrics, then management tools could view this data.
> # the metrics registry would need to be passed down to the instantiated 
> {{ApplicationHistoryProvider}}, in order for it to register its metrics.
> # if the codahale metrics servlet were registered under a path such as 
> {{/metrics}}, the values would be visible as HTML and JSON, without the need 
> for management tools.
> # Integration tests could also retrieve the JSON-formatted data and use it as 
> part of the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9137) Unified label verification for Classifier

2015-11-26 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029170#comment-15029170
 ] 

Maciej Szymkiewicz commented on SPARK-9137:
---

[~josephkb] Could you take a look at [this question on 
SO|http://stackoverflow.com/q/33708532/1560062]? Do you think there should be a 
separate JIRA for this or is it simply covered by this one?

> Unified label verification for Classifier
> -
>
> Key: SPARK-9137
> URL: https://issues.apache.org/jira/browse/SPARK-9137
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> We should to check label valid before training model for ml.classification 
> such as LogisticRegression, NaiveBayes, etc. We can make this check at 
> extractLabeledPoints. Some models do this check during training step at 
> present and we need to unified them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11206) Support SQL UI on the history server

2015-11-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029183#comment-15029183
 ] 

Ted Yu commented on SPARK-11206:


Looks like SQLListenerMemoryLeakSuite fails on maven Jenkins now.
e.g.
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/4251/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/org.apache.spark.sql.execution.ui/SQLListenerMemoryLeakSuite/no_memory_leak/

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>Assignee: Carson Wang
> Fix For: 1.7.0
>
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11863) Unable to resolve order by if it contains mixture of aliases and real columns.

2015-11-26 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11863.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9961
[https://github.com/apache/spark/pull/9961]

> Unable to resolve order by if it contains mixture of aliases and real columns.
> --
>
> Key: SPARK-11863
> URL: https://issues.apache.org/jira/browse/SPARK-11863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dilip Biswal
> Fix For: 1.6.0
>
>
> Analyzer is unable to resolve order by if the columns in the order by clause 
> contains a mixture of alias and real column names.
> Example :
> var var3 = sqlContext.sql("select c1 as a, c2 as b from inttab group by c1, 
> c2 order by  b, c1")
> This used to work in 1.4 and is failing starting 1.5 and is affecting some 
> tpcds queries (19, 55,71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

2015-11-26 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029226#comment-15029226
 ] 

Xiu(Joe) Guo commented on SPARK-6644:
-

With the current master branch code line (1.6.0-snapshot), this issue cannot be 
reproduced anymore.

{panel}
scala> sqlContext.sql("DROP TABLE IF EXISTS table_with_partition ")
res6: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS table_with_partition (key 
INT, value STRING) PARTITIONED BY (ds STRING)")
res7: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION 
(ds = '1') SELECT key, value FROM testData")
res8: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("select * from table_with_partition")
res9: org.apache.spark.sql.DataFrame = [key: int, value: string, ds: string]

scala> sqlContext.sql("select * from table_with_partition").show
|key|value| ds|
|  1|1|  1|
|  2|2|  1|

scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 
STRING)")
res11: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng 
DOUBLE)") 
res12: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION 
(ds = '1') SELECT key, value, 'test', 1.11 FROM testData")
res13: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("SELECT * FROM table_with_partition").show
|key|value|key1|destlng| ds|
|  1|1|test|   1.11|  1|
|  2|2|test|   1.11|  1|
{panel}

> After adding new columns to a partitioned table and inserting data to an old 
> partition, data of newly added columns are all NULL
> 
>
> Key: SPARK-6644
> URL: https://issues.apache.org/jira/browse/SPARK-6644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: dongxu
>
> In Hive, the schema of a partition may differ from the table schema. For 
> example, we may add new columns to the table after importing existing 
> partitions. When using {{spark-sql}} to query the data in a partition whose 
> schema is different from the table schema, problems may arise. Part of them 
> have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
> However, after adding new column(s) to the table, when inserting data into 
> old partitions, values of newly added columns are all {{NULL}}.
> The following snippet can be used to reproduce this issue:
> {code}
> case class TestData(key: Int, value: String)
> val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => 
> TestData(i, i.toString))).toDF()
> testData.registerTempTable("testData")
> sql("DROP TABLE IF EXISTS table_with_partition ")
> sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
> PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'")
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
> key, value FROM testData")
> // Add new columns to the table
> sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)")
> sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") 
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
> key, value, 'test', 1.11 FROM testData")
> sql("SELECT * FROM table_with_partition WHERE ds = 
> '1'").collect().foreach(println)
> {code}
> Actual result:
> {noformat}
> [1,1,null,null,1]
> [2,2,null,null,1]
> {noformat}
> Expected result:
> {noformat}
> [1,1,test,1.11,1]
> [2,2,test,1.11,1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11863) Unable to resolve order by if it contains mixture of aliases and real columns.

2015-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11863:
--
Assignee: Wenchen Fan

> Unable to resolve order by if it contains mixture of aliases and real columns.
> --
>
> Key: SPARK-11863
> URL: https://issues.apache.org/jira/browse/SPARK-11863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dilip Biswal
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> Analyzer is unable to resolve order by if the columns in the order by clause 
> contains a mixture of alias and real column names.
> Example :
> var var3 = sqlContext.sql("select c1 as a, c2 as b from inttab group by c1, 
> c2 order by  b, c1")
> This used to work in 1.4 and is failing starting 1.5 and is affecting some 
> tpcds queries (19, 55,71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12015) Auto convert int to Double when required in pyspark.ml

2015-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12015:
--
Fix Version/s: (was: 1.6.0)

[~jamartinh] don't set fix version

> Auto convert int to Double when required in pyspark.ml
> --
>
> Key: SPARK-12015
> URL: https://issues.apache.org/jira/browse/SPARK-12015
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.2
>Reporter: Jose Antonio
>  Labels: ml, py4j, pyspark, type-converter
>
> I have received the following exception:
> Why we cannot pass an integer when the required parameter is Double?
> I think this should be upcasted silently.
> java.lang.IllegalArgumentException: requirement failed: Column label must be 
> of type DoubleType but was actually IntegerType.
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>   at 
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
>   at 
> org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:56)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11707) StreamCorruptedException if authentication is enabled

2015-11-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029293#comment-15029293
 ] 

Reynold Xin commented on SPARK-11707:
-

cc [~vanzin] do you want to look into it?


> StreamCorruptedException if authentication is enabled
> -
>
> Key: SPARK-11707
> URL: https://issues.apache.org/jira/browse/SPARK-11707
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> When authentication (and encryption) is enabled (at least in standalone 
> mode), the following code (in Spark shell):
> {code}
> sc.makeRDD(1 to 10, 10).map(x => x*x).map(_.toString).reduce(_ + _)
> {code}
> finishes with exception:
> {noformat}
> [Stage 0:> (0 + 8) / 
> 10]15/11/12 20:36:29 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() on RPC id 5750598674048943239
> java.io.StreamCorruptedException: invalid type code: 30
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2508)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2543)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2702)
>   at java.io.ObjectInputStream.read(ObjectInputStream.java:865)
>   at 
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$readObject$1.apply(SerializableBuffer.scala:38)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$readObject$1.apply(SerializableBuffer.scala:32)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1186)
>   at 
> org.apache.spark.util.SerializableBuffer.readObject(SerializableBuffer.scala:32)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:248)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:296)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:247)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:448)
>   at 
> org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:76)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:122)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:94)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.ha

[jira] [Commented] (SPARK-11988) Update JPMML to 1.2.7

2015-11-26 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029303#comment-15029303
 ] 

Vincenzo Selvaggio commented on SPARK-11988:


[~srowen] [~vfed]
We did discuss this on my pull request when we first introduced the pmml 
support:
https://github.com/apache/spark/pull/3062#issuecomment-94655050

I was asked by [~mengxr] to make it Jave 6 compatible as I did use 1.2 to start 
with and changed back to 1.1.
If Spark doesn't support Java 6 anymore it would be good to update to 1.2.

Thanks

> Update JPMML to 1.2.7
> -
>
> Key: SPARK-11988
> URL: https://issues.apache.org/jira/browse/SPARK-11988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Pretty much says it all: update to the more recent JPMML 1.2.x line from 
> 1.1.x. 
> The API did change between the 2 versions, though all use of it is internal 
> to Spark, so doesn't affect end users in theory.
> In practice, JPMML leaks into the classpath and might cause issues. My 
> rationale for the change is that 1.2.x is more recent and more stable (right 
> [~vfed]?) so if people are going to run into it, might as well run into 
> something more modern.
> And bug fixes and more support and all that.
> CC [~selvinsource]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-26 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029304#comment-15029304
 ] 

Radoslaw Gruchalski commented on SPARK-11638:
-

I have added a change to enable {{spark.driver.advertisedPort}} support in 
{{NettyRpcEnv}}: 
https://github.com/radekg/spark/commit/b21aae1468169ee0a388d33ba6cebdb17b895956#diff-0c89b4a60c30a7cd2224bb64d93da942R125
Not quite sure what the impact of this change is. It would be great if somebody 
could cross check.

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the

[jira] [Created] (SPARK-12018) Refactor common subexpression elimination code

2015-11-26 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-12018:
---

 Summary: Refactor common subexpression elimination code
 Key: SPARK-12018
 URL: https://issues.apache.org/jira/browse/SPARK-12018
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-26 Thread Radoslaw Gruchalski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radoslaw Gruchalski updated SPARK-11638:

Attachment: (was: 1.6.0-master.patch)

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 2.4.x yet.
> What we want is the back port of mentioned {{akka-remote}} settings to 
> {{2.3.x}} versions. These patches are attached to this ticket - 
> {{2.3.4.patch}} and {{2.3.11.patch}} files provide patche

[jira] [Updated] (SPARK-12018) Refactor common subexpression elimination code

2015-11-26 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-12018:

Description: 
The code of common subexpression elimination can be factored and simplified. 
Some unnecessary variables can be removed.


> Refactor common subexpression elimination code
> --
>
> Key: SPARK-12018
> URL: https://issues.apache.org/jira/browse/SPARK-12018
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The code of common subexpression elimination can be factored and simplified. 
> Some unnecessary variables can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12018) Refactor common subexpression elimination code

2015-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029308#comment-15029308
 ] 

Apache Spark commented on SPARK-12018:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10009

> Refactor common subexpression elimination code
> --
>
> Key: SPARK-12018
> URL: https://issues.apache.org/jira/browse/SPARK-12018
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-26 Thread Radoslaw Gruchalski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radoslaw Gruchalski updated SPARK-11638:

Attachment: 1.6.0.patch

Updated 1.6.0 patch to include NettyRpcEnv.

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 2.4.x yet.
> What we want is the back port of mentioned {{akka-remote}} settings to 
> {{2.3.x}} versions. These patches are attached to this ticket - 
> {{2.3.4.patch}} a

[jira] [Assigned] (SPARK-12018) Refactor common subexpression elimination code

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12018:


Assignee: Apache Spark

> Refactor common subexpression elimination code
> --
>
> Key: SPARK-12018
> URL: https://issues.apache.org/jira/browse/SPARK-12018
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12018) Refactor common subexpression elimination code

2015-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12018:


Assignee: (was: Apache Spark)

> Refactor common subexpression elimination code
> --
>
> Key: SPARK-12018
> URL: https://issues.apache.org/jira/browse/SPARK-12018
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11988) Update JPMML to 1.2.7

2015-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029310#comment-15029310
 ] 

Sean Owen commented on SPARK-11988:
---

Yeah Java 7 is the minimum now.

> Update JPMML to 1.2.7
> -
>
> Key: SPARK-11988
> URL: https://issues.apache.org/jira/browse/SPARK-11988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Pretty much says it all: update to the more recent JPMML 1.2.x line from 
> 1.1.x. 
> The API did change between the 2 versions, though all use of it is internal 
> to Spark, so doesn't affect end users in theory.
> In practice, JPMML leaks into the classpath and might cause issues. My 
> rationale for the change is that 1.2.x is more recent and more stable (right 
> [~vfed]?) so if people are going to run into it, might as well run into 
> something more modern.
> And bug fixes and more support and all that.
> CC [~selvinsource]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11988) Update JPMML to 1.2.7

2015-11-26 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029313#comment-15029313
 ] 

Vincenzo Selvaggio commented on SPARK-11988:


[~vfed] I have a related question, if we do update to 1.2, could you ensure the 
version attribute is added to the root node so we don't have to hard code the 
version, see this recent pull request
https://github.com/apache/spark/pull/9558/files


> Update JPMML to 1.2.7
> -
>
> Key: SPARK-11988
> URL: https://issues.apache.org/jira/browse/SPARK-11988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Pretty much says it all: update to the more recent JPMML 1.2.x line from 
> 1.1.x. 
> The API did change between the 2 versions, though all use of it is internal 
> to Spark, so doesn't affect end users in theory.
> In practice, JPMML leaks into the classpath and might cause issues. My 
> rationale for the change is that 1.2.x is more recent and more stable (right 
> [~vfed]?) so if people are going to run into it, might as well run into 
> something more modern.
> And bug fixes and more support and all that.
> CC [~selvinsource]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-11-26 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029318#comment-15029318
 ] 

Karl Higley commented on SPARK-5992:


I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding vectors. The 
two "types" of LSH mentioned here seem more like two kinds of operations which 
are sometimes applied sequentially. Maybe this distinction makes more sense for 
other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-11-26 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029318#comment-15029318
 ] 

Karl Higley edited comment on SPARK-5992 at 11/26/15 11:20 PM:
---

I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding feature 
vectors. The two "types" of LSH mentioned here seem more like two kinds of 
operations which are sometimes applied sequentially. Maybe this distinction 
makes more sense for other types of LSH?


was (Author: karlhigley):
I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding vectors. The 
two "types" of LSH mentioned here seem more like two kinds of operations which 
are sometimes applied sequentially. Maybe this distinction makes more sense for 
other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11988) Update JPMML to 1.2.7

2015-11-26 Thread Villu Ruusmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029319#comment-15029319
 ] 

Villu Ruusmann commented on SPARK-11988:


PMML@version is a required attribute. Your IDE should bring it to your 
attention everytime you instantiate a new org.dmg.pmml.PMML object - always 
choose an overloaded constructor over the default (ie. no-arg) constructor.

I would recommend you to set up some utility class that takes care of 
initializing basic document structure. For example, I have a general purpose 
JPMML-Converter project that I include to specialized JPMML-R and JPMML-SkLearn 
converter projects.

> Update JPMML to 1.2.7
> -
>
> Key: SPARK-11988
> URL: https://issues.apache.org/jira/browse/SPARK-11988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Pretty much says it all: update to the more recent JPMML 1.2.x line from 
> 1.1.x. 
> The API did change between the 2 versions, though all use of it is internal 
> to Spark, so doesn't affect end users in theory.
> In practice, JPMML leaks into the classpath and might cause issues. My 
> rationale for the change is that 1.2.x is more recent and more stable (right 
> [~vfed]?) so if people are going to run into it, might as well run into 
> something more modern.
> And bug fixes and more support and all that.
> CC [~selvinsource]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11988) Update JPMML to 1.2.7

2015-11-26 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029320#comment-15029320
 ] 

Vincenzo Selvaggio commented on SPARK-11988:


OK thanks!

> Update JPMML to 1.2.7
> -
>
> Key: SPARK-11988
> URL: https://issues.apache.org/jira/browse/SPARK-11988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Pretty much says it all: update to the more recent JPMML 1.2.x line from 
> 1.1.x. 
> The API did change between the 2 versions, though all use of it is internal 
> to Spark, so doesn't affect end users in theory.
> In practice, JPMML leaks into the classpath and might cause issues. My 
> rationale for the change is that 1.2.x is more recent and more stable (right 
> [~vfed]?) so if people are going to run into it, might as well run into 
> something more modern.
> And bug fixes and more support and all that.
> CC [~selvinsource]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >