date:20151015

[jira] [Commented] (SPARK-3950) Completed time is blank for some successful tasks

2015-10-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960220#comment-14960220
 ] 

Jean-Baptiste Onofré commented on SPARK-3950:
-

I don't reproduce the issue. On 1.6.0-SNAPSHOT, thanks to 
getFormattedTimeQuantiles(), the tasks duration are expressed as ms when 
required, like GC time.

I think this issue can be closed.

> Completed time is blank for some successful tasks
> -
>
> Key: SPARK-3950
> URL: https://issues.apache.org/jira/browse/SPARK-3950
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1
>Reporter: Aaron Davidson
>
> In the Spark web UI, some tasks appear to have a blank Duration column. It's 
> possible that these ran for <.5 seconds, but if so, we should use 
> milliseconds like we do for GC time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10754) table and column name are case sensitive when json Dataframe was registered as tempTable using JavaSparkContext.

2015-10-15 Thread Babulal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960205#comment-14960205
 ] 

Babulal commented on SPARK-10754:
-

Thank you Huaxin Gao for reply 

i checked with "spark.sql.caseSensitive=false " option it is working fine. 



Can we either make it  default to false or document it (which you suggested ). 

i guess it is referred from  SQLConf.scala  


val DIALECT = "spark.sql.dialect"
val CASE_SENSITIVE = "spark.sql.caseSensitive"
  

  /**
   * caseSensitive analysis true by default
   */
  def caseSensitiveAnalysis: Boolean = getConf(SQLConf.CASE_SENSITIVE, 
"true").toBoolean

> table and column name are case sensitive when json Dataframe was registered 
> as tempTable using JavaSparkContext. 
> -
>
> Key: SPARK-10754
> URL: https://issues.apache.org/jira/browse/SPARK-10754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.1
> Environment: Linux ,Hadoop Version 1.3
>Reporter: Babulal
>
> Create a dataframe using json data source 
>   SparkConf conf=new 
> SparkConf().setMaster("spark://xyz:7077")).setAppName("Spark Tabble");
>   JavaSparkContext javacontext=new JavaSparkContext(conf);
>   SQLContext sqlContext=new SQLContext(javacontext);
>   
>   DataFrame df = 
> sqlContext.jsonFile("/user/root/examples/src/main/resources/people.json");
>   
>   df.registerTempTable("sparktable");
>   
>   Run the Query
>   
>   sqlContext.sql("select * from sparktable").show()// this will PASs
>   
>   
>   sqlContext.sql("select * from sparkTable").show()/// This will FAIL 
>   
>   java.lang.RuntimeException: Table Not Found: sparkTable
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:58)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:233)
>   
>   
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11120:


Assignee: (was: Apache Spark)

> maxNumExecutorFailures defaults to 3 under dynamic allocation
> -
>
> Key: SPARK-11120
> URL: https://issues.apache.org/jira/browse/SPARK-11120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> With dynamic allocation, the {{spark.executor.instances}} config is 0, 
> meaning [this 
> line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68]
>  ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has 
> resulted in large dynamicAllocation jobs with hundreds of executors dying due 
> to one bad node serially failing executors that are allocated on it.
> I think that using {{spark.dynamicAllocation.maxExecutors}} would make most 
> sense in this case; I frequently run shells that vary between 1 and 1000 
> executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would 
> still leave me with a value that is lower than makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11120:


Assignee: Apache Spark

> maxNumExecutorFailures defaults to 3 under dynamic allocation
> -
>
> Key: SPARK-11120
> URL: https://issues.apache.org/jira/browse/SPARK-11120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Assignee: Apache Spark
>Priority: Minor
>
> With dynamic allocation, the {{spark.executor.instances}} config is 0, 
> meaning [this 
> line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68]
>  ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has 
> resulted in large dynamicAllocation jobs with hundreds of executors dying due 
> to one bad node serially failing executors that are allocated on it.
> I think that using {{spark.dynamicAllocation.maxExecutors}} would make most 
> sense in this case; I frequently run shells that vary between 1 and 1000 
> executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would 
> still leave me with a value that is lower than makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960174#comment-14960174
 ] 

Apache Spark commented on SPARK-11120:
--

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/9147

> maxNumExecutorFailures defaults to 3 under dynamic allocation
> -
>
> Key: SPARK-11120
> URL: https://issues.apache.org/jira/browse/SPARK-11120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> With dynamic allocation, the {{spark.executor.instances}} config is 0, 
> meaning [this 
> line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68]
>  ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has 
> resulted in large dynamicAllocation jobs with hundreds of executors dying due 
> to one bad node serially failing executors that are allocated on it.
> I think that using {{spark.dynamicAllocation.maxExecutors}} would make most 
> sense in this case; I frequently run shells that vary between 1 and 1000 
> executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would 
> still leave me with a value that is lower than makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation

2015-10-15 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960169#comment-14960169
 ] 

Ryan Williams commented on SPARK-11120:
---

Without dynamic allocation, you are allowed [twice the number of executors] 
failures, which seems reasonable.

With dynamic allocation, {{spark.executor.instances}} doesn't get set, and so 
you are allowed {{math.max(0 * 2, 3)}} failures, no matter how many executors 
your job has as its min, initial, and max settings.


> maxNumExecutorFailures defaults to 3 under dynamic allocation
> -
>
> Key: SPARK-11120
> URL: https://issues.apache.org/jira/browse/SPARK-11120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> With dynamic allocation, the {{spark.executor.instances}} config is 0, 
> meaning [this 
> line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68]
>  ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has 
> resulted in large dynamicAllocation jobs with hundreds of executors dying due 
> to one bad node serially failing executors that are allocated on it.
> I think that using {{spark.dynamicAllocation.maxExecutors}} would make most 
> sense in this case; I frequently run shells that vary between 1 and 1000 
> executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would 
> still leave me with a value that is lower than makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: Move predictNodeIndex to LearningNode

2015-10-15 Thread Luvsandondov Lkhamsuren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960164#comment-14960164
 ] 

Luvsandondov Lkhamsuren commented on SPARK-9963:


Please let me know if it needs an additional fix.

Thanks

> ML RandomForest cleanup: Move predictNodeIndex to LearningNode
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> (updated form original description)
> Move ml.tree.impl.RandomForest.predictNodeIndex to LearningNode.
> We need to keep it as a separate method from Node.predictImpl because (a) it 
> needs to operate on binned features and (b) it needs to return the node ID, 
> not the node (because it can return the ID for nodes which do not yet exist).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11136) Warm-start support for ML estimator

2015-10-15 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11136:
--
Description: 
The current implementation of Estimator does not support warm-start fitting, 
i.e. estimator.fit(data, params, partialModel). But first we need to add 
warm-start for all ML estimators. This is an umbrella JIRA to add support for 
the warm-start estimator. 

Treat model as a special parameter, passing it through ParamMap. e.g. val 
partialModel: Param[Option[M]] = new Param(...). In the case of model existing, 
we use it to warm-start, else we start the training process from the beginning.



  was:
The current implementation of Estimator does not support warm-start fitting, 
i.e. estimator.fit(data, params, partialModel). But first we need to add 
warm-start for all ML estimators. This is an umbrella JIRA to add support for 
the warm-start estimator. 

Possible solutions:

1. Add warm-start fitting interface like def fit(dataset: DataFrame, initModel: 
M, paramMap: ParamMap): M

2. Treat model as a special parameter, passing it through ParamMap. e.g. val 
partialModel: Param[Option[M]] = new Param(...). In the case of model existing, 
we use it to warm-start, else we start the training process from the beginning.




> Warm-start support for ML estimator
> ---
>
> Key: SPARK-11136
> URL: https://issues.apache.org/jira/browse/SPARK-11136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> The current implementation of Estimator does not support warm-start fitting, 
> i.e. estimator.fit(data, params, partialModel). But first we need to add 
> warm-start for all ML estimators. This is an umbrella JIRA to add support for 
> the warm-start estimator. 
> Treat model as a special parameter, passing it through ParamMap. e.g. val 
> partialModel: Param[Option[M]] = new Param(...). In the case of model 
> existing, we use it to warm-start, else we start the training process from 
> the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker

2015-10-15 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960119#comment-14960119
 ] 

Klaus Ma commented on SPARK-11143:
--

I also got a feedback from StackOverflow at 
http://stackoverflow.com/questions/33160859/how-to-enable-spark-mesos-docker-executor;
 but I think it's also worth to enhance Spark to use ubuntu image only, because 
the solution of special image is only set work directory :(.

> SparkMesosDispatcher can not launch driver in docker
> 
>
> Key: SPARK-11143
> URL: https://issues.apache.org/jira/browse/SPARK-11143
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1
> Environment: Ubuntu 14.04
>Reporter: Klaus Ma
>
> I'm working on integration between Mesos & Spark. For now, I can start 
> SlaveMesosDispatcher in a docker; and I like to also run Spark executor in 
> Mesos docker. I do the following configuration for it, but I got an error; 
> any suggestion?
> Configuration:
> Spark: conf/spark-defaults.conf
> {code}
> spark.mesos.executor.docker.imageubuntu
> spark.mesos.executor.docker.volumes  
> /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
> spark.mesos.executor.home/root/spark
> #spark.executorEnv.SPARK_HOME /root/spark
> spark.executorEnv.MESOS_NATIVE_LIBRARY   /usr/local/lib
> {code}
> NOTE: The spark are installed in /home/test/workshop/spark, and all 
> dependencies are installed.
> After submit SparkPi to the dispatcher, the driver job is started but failed. 
> The error messes is:
> {code}
> I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
> I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave 
> b7e24114-7585-40bc-879b-6a1188cb65b6-S1
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> /bin/sh: 1: ./bin/spark-submit: not found
> {code}
> Does any know how to map/set spark home in docker for this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator

2015-10-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960116#comment-14960116
 ] 

Xusen Yin commented on SPARK-11136:
---

Sure. And I will add more subtasks on this JIRA to indicate other possible 
warm-start estimators.

> Warm-start support for ML estimator
> ---
>
> Key: SPARK-11136
> URL: https://issues.apache.org/jira/browse/SPARK-11136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> The current implementation of Estimator does not support warm-start fitting, 
> i.e. estimator.fit(data, params, partialModel). But first we need to add 
> warm-start for all ML estimators. This is an umbrella JIRA to add support for 
> the warm-start estimator. 
> Possible solutions:
> 1. Add warm-start fitting interface like def fit(dataset: DataFrame, 
> initModel: M, paramMap: ParamMap): M
> 2. Treat model as a special parameter, passing it through ParamMap. e.g. val 
> partialModel: Param[Option[M]] = new Param(...). In the case of model 
> existing, we use it to warm-start, else we start the training process from 
> the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker

2015-10-15 Thread Klaus Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated SPARK-11143:
-
Description: 
I'm working on integration between Mesos & Spark. For now, I can start 
SlaveMesosDispatcher in a docker; and I like to also run Spark executor in 
Mesos docker. I do the following configuration for it, but I got an error; any 
suggestion?

Configuration:

Spark: conf/spark-defaults.conf

{code}
spark.mesos.executor.docker.imageubuntu
spark.mesos.executor.docker.volumes  
/usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
spark.mesos.executor.home/root/spark
#spark.executorEnv.SPARK_HOME /root/spark
spark.executorEnv.MESOS_NATIVE_LIBRARY   /usr/local/lib
{code}

NOTE: The spark are installed in /home/test/workshop/spark, and all 
dependencies are installed.

After submit SparkPi to the dispatcher, the driver job is started but failed. 
The error messes is:
{code}
I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave 
b7e24114-7585-40bc-879b-6a1188cb65b6-S1
WARNING: Your kernel does not support swap limit capabilities, memory limited 
without swap.
/bin/sh: 1: ./bin/spark-submit: not found
{code}
Does any know how to map/set spark home in docker for this case?

  was:
I'm working on integration between Mesos & Spark. For now, I can start 
SlaveMesosDispatcher in a docker; and I like to also run Spark executor in 
Mesos docker. I do the following configuration for it, but I got an error; any 
suggestion?

Configuration:

Spark: conf/spark-defaults.conf

spark.mesos.executor.docker.imageubuntu
spark.mesos.executor.docker.volumes  
/usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
spark.mesos.executor.home/root/spark
#spark.executorEnv.SPARK_HOME /root/spark
spark.executorEnv.MESOS_NATIVE_LIBRARY   /usr/local/lib

NOTE: The spark are installed in /home/test/workshop/spark, and all 
dependencies are installed.

After submit SparkPi to the dispatcher, the driver job is started but failed. 
The error messes is:

I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave 
b7e24114-7585-40bc-879b-6a1188cb65b6-S1
WARNING: Your kernel does not support swap limit capabilities, memory limited 
without swap.
/bin/sh: 1: ./bin/spark-submit: not found

Does any know how to map/set spark home in docker for this case?


> SparkMesosDispatcher can not launch driver in docker
> 
>
> Key: SPARK-11143
> URL: https://issues.apache.org/jira/browse/SPARK-11143
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1
> Environment: Ubuntu 14.04
>Reporter: Klaus Ma
>
> I'm working on integration between Mesos & Spark. For now, I can start 
> SlaveMesosDispatcher in a docker; and I like to also run Spark executor in 
> Mesos docker. I do the following configuration for it, but I got an error; 
> any suggestion?
> Configuration:
> Spark: conf/spark-defaults.conf
> {code}
> spark.mesos.executor.docker.imageubuntu
> spark.mesos.executor.docker.volumes  
> /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
> spark.mesos.executor.home/root/spark
> #spark.executorEnv.SPARK_HOME /root/spark
> spark.executorEnv.MESOS_NATIVE_LIBRARY   /usr/local/lib
> {code}
> NOTE: The spark are installed in /home/test/workshop/spark, and all 
> dependencies are installed.
> After submit SparkPi to the dispatcher, the driver job is started but failed. 
> The error messes is:
> {code}
> I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
> I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave 
> b7e24114-7585-40bc-879b-6a1188cb65b6-S1
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> /bin/sh: 1: ./bin/spark-submit: not found
> {code}
> Does any know how to map/set spark home in docker for this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker

2015-10-15 Thread Klaus Ma (JIRA)

Klaus Ma created SPARK-11143:


 Summary: SparkMesosDispatcher can not launch driver in docker
 Key: SPARK-11143
 URL: https://issues.apache.org/jira/browse/SPARK-11143
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.5.1
 Environment: Ubuntu 14.04
Reporter: Klaus Ma


I'm working on integration between Mesos & Spark. For now, I can start 
SlaveMesosDispatcher in a docker; and I like to also run Spark executor in 
Mesos docker. I do the following configuration for it, but I got an error; any 
suggestion?

Configuration:

Spark: conf/spark-defaults.conf

spark.mesos.executor.docker.imageubuntu
spark.mesos.executor.docker.volumes  
/usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
spark.mesos.executor.home/root/spark
#spark.executorEnv.SPARK_HOME /root/spark
spark.executorEnv.MESOS_NATIVE_LIBRARY   /usr/local/lib

NOTE: The spark are installed in /home/test/workshop/spark, and all 
dependencies are installed.

After submit SparkPi to the dispatcher, the driver job is started but failed. 
The error messes is:

I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave 
b7e24114-7585-40bc-879b-6a1188cb65b6-S1
WARNING: Your kernel does not support swap limit capabilities, memory limited 
without swap.
/bin/sh: 1: ./bin/spark-submit: not found

Does any know how to map/set spark home in docker for this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version

2015-10-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11127:
--
Assignee: Tathagata Das

> Upgrade Kinesis Client Library to the latest stable version
> ---
>
> Key: SPARK-11127
> URL: https://issues.apache.org/jira/browse/SPARK-11127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Tathagata Das
>
> We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with 
> Kinesis Producer Library (KPL) and support auto de-aggregation. It would be 
> great to upgrade KCL to the latest stable version.
> Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with 
> dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See 
> https://github.com/awslabs/amazon-kinesis-client#release-notes.
> [~tdas] [~brkyvz] Please recommend a version for upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9695) Add random seed Param to ML Pipeline

2015-10-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960080#comment-14960080
 ] 

Joseph K. Bradley commented on SPARK-9695:
--

{quote}I think we should store the whole pipeline and each stage's seed to 
reproduce the same results.{quote}
--> This will be possible for PipelineModel (with a fixed set of stages), but 
can we do it for Pipeline (with a mutable set of stages)?  We might have to 
have a weaker set of guarantees for Pipeline than PipelineModel.

That'd be great if you can send a patch---thanks!

> Add random seed Param to ML Pipeline
> 
>
> Key: SPARK-9695
> URL: https://issues.apache.org/jira/browse/SPARK-9695
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Note this will require some discussion about whether to make HasSeed the main 
> API for whether an algorithm takes a seed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator

2015-10-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960073#comment-14960073
 ] 

Joseph K. Bradley commented on SPARK-11136:
---

We should definitely have it be a Param.  I just comment on the KMeans JIRA 
about that.  Thanks for pointing out that issue.  Would you mind updating this 
JIRA's description to specify that as the chosen option?

> Warm-start support for ML estimator
> ---
>
> Key: SPARK-11136
> URL: https://issues.apache.org/jira/browse/SPARK-11136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> The current implementation of Estimator does not support warm-start fitting, 
> i.e. estimator.fit(data, params, partialModel). But first we need to add 
> warm-start for all ML estimators. This is an umbrella JIRA to add support for 
> the warm-start estimator. 
> Possible solutions:
> 1. Add warm-start fitting interface like def fit(dataset: DataFrame, 
> initModel: M, paramMap: ParamMap): M
> 2. Treat model as a special parameter, passing it through ParamMap. e.g. val 
> partialModel: Param[Option[M]] = new Param(...). In the case of model 
> existing, we use it to warm-start, else we start the training process from 
> the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960072#comment-14960072
 ] 

Joseph K. Bradley commented on SPARK-10780:
---

[~jayants]  I agree with [~yinxusen]: The initialModel should be a Param and 
follow the example of other Params.  Could you please update your PR 
accordingly?  Thanks!

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator

2015-10-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960063#comment-14960063
 ] 

Xusen Yin commented on SPARK-11136:
---

I have already linked all related issues. [~josephkb] Which kind of methods of 
supporting warm-start do you prefer? Or other feasible suggestions? In 
[~jayants]'s code of KMeans warm-start we can see the 3rd implementation.

> Warm-start support for ML estimator
> ---
>
> Key: SPARK-11136
> URL: https://issues.apache.org/jira/browse/SPARK-11136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> The current implementation of Estimator does not support warm-start fitting, 
> i.e. estimator.fit(data, params, partialModel). But first we need to add 
> warm-start for all ML estimators. This is an umbrella JIRA to add support for 
> the warm-start estimator. 
> Possible solutions:
> 1. Add warm-start fitting interface like def fit(dataset: DataFrame, 
> initModel: M, paramMap: ParamMap): M
> 2. Treat model as a special parameter, passing it through ParamMap. e.g. val 
> partialModel: Param[Option[M]] = new Param(...). In the case of model 
> existing, we use it to warm-start, else we start the training process from 
> the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-10-15 Thread Jia Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960052#comment-14960052
 ] 

Jia Li commented on SPARK-5472:
---

[~tmyklebu] Does your PR handle BINARY type? 

Thanks,

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11142) org.datanucleus is already registered

2015-10-15 Thread raicho (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

raicho updated SPARK-11142:
---
Priority: Minor  (was: Major)

> org.datanucleus is already registered
> -
>
> Key: SPARK-11142
> URL: https://issues.apache.org/jira/browse/SPARK-11142
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Windows7 Home Basic
>Reporter: raicho
>Priority: Minor
> Fix For: 1.5.1
>
>
> I first setup Spark this Wednesday on my computer. When I executed 
> spark-shell.cmd, warns shows on the screen like "org.datanucleus is already 
> registered. Ensure you don't have multiple JAR versions of the same plugin in 
> the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.10.jar" is 
> already registered and you are trying to register an identical plugin located 
> at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.10.jar"  " and 
> "org.datanucleus.api.jdo is already registered. Ensure you don't have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/c:/spark/lib/datanucleus-core-3.2.6.jar" is already registered and you 
> are trying to register an identical plugin located at URL 
> "file:/c:/spark/bin/../lib/datanucleus-core-3.2.6.jar" "
> The two URLs shown in fact mean the same path. I tried to find the classpath 
> in the configuration files but failed. No other codes outside has been 
> executed on spark yet.
> What happened and how to deal with the warn?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960041#comment-14960041
 ] 

Xusen Yin commented on SPARK-10780:
---

This belongs to the SPARK-11136.

But we need to pay more attention on unified implementations, since other 
estimators will add warm-start.

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9695) Add random seed Param to ML Pipeline

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960034#comment-14960034
 ] 

Yanbo Liang commented on SPARK-9695:


I agree if users set pipeline stage's seed, it has higher priority than the 
pipeline's seed.
To the pipeline storage and load, I think we should store the whole pipeline 
and each stage's seed to reproduce the same results. This issue should 
considered at the pipeline and stage's storage and load related tasks.
I think the assumption of random number generator should not change behavior 
across Spark versions is reasonable.
I will try to submit an initial patch for this issue and looking forward your 
comments.

> Add random seed Param to ML Pipeline
> 
>
> Key: SPARK-9695
> URL: https://issues.apache.org/jira/browse/SPARK-9695
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Note this will require some discussion about whether to make HasSeed the main 
> API for whether an algorithm takes a seed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11142) org.datanucleus is already registered

2015-10-15 Thread raicho (JIRA)

raicho created SPARK-11142:
--

 Summary: org.datanucleus is already registered
 Key: SPARK-11142
 URL: https://issues.apache.org/jira/browse/SPARK-11142
 Project: Spark
  Issue Type: Question
  Components: Spark Shell
Affects Versions: 1.5.1
 Environment: Windows7 Home Basic
Reporter: raicho
 Fix For: 1.5.1


I first setup Spark this Wednesday on my computer. When I executed 
spark-shell.cmd, warns shows on the screen like "org.datanucleus is already 
registered. Ensure you don't have multiple JAR versions of the same plugin in 
the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.10.jar" is 
already registered and you are trying to register an identical plugin located 
at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.10.jar"  " and 
"org.datanucleus.api.jdo is already registered. Ensure you don't have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/c:/spark/lib/datanucleus-core-3.2.6.jar" is already registered and you 
are trying to register an identical plugin located at URL 
"file:/c:/spark/bin/../lib/datanucleus-core-3.2.6.jar" "

The two URLs shown in fact mean the same path. I tried to find the classpath in 
the configuration files but failed. No other codes outside has been executed on 
spark yet.
What happened and how to deal with the warn?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978
 ] 

Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:43 AM:
--

[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
Brief design doc:
* How input and output columns will be specified
/** @group setParam */
  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
/** @group setParam */
  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
* Schema validation
Make transformSchema adaptive to multiple input and output columns.
* Code sharing to reduce duplication
For backwards compatibility, we must not modify current Params, we add a new 
one for multiple inputs (and check for conflicting settings when running). 
Reimplement transformers to support multi-value implementation and make the 
single-value interface a trivial invocation of the multi-value code. I think we 
should maximum reuse the transform function of a single-value to implement the 
multi-value one, but it can not completely shared code depends on different 
transformers.

So can I firstly try to start sub-tasks with StringIndexer and OneHotEncoder 
which is mostly common used?
 


was (Author: yanboliang):
[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
Brief design doc:
* How input and output columns will be specified
/** @group setParam */
  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
/** @group setParam */
  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
* Schema validation
Make transformSchema adaptive to multiple input and output columns.
* Code sharing to reduce duplication
For backwards compatibility, we must not modify current Params, we add a new 
one for multiple inputs (and check for conflicting settings when running). 
Reimplement transformers to support multi-value implementation and make the 
single-value interface a trivial invocation of the multi-value code. I think we 
should maximum reuse the transform function of a single-value to implement the 
multi-value one, but it can not completely shared code depends on different 
transformers.

I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder 
which is mostly common used.
 

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978
 ] 

Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:26 AM:
--

[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
Brief design doc:
* How input and output columns will be specified
/** @group setParam */
  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
/** @group setParam */
  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
* Schema validation
Make transformSchema adaptive to multiple input and output columns.
* Code sharing to reduce duplication
For backwards compatibility, we must not modify current Params, we add a new 
one for multiple inputs (and check for conflicting settings when running). 
Reimplement transformers to support multi-value implementation and make the 
single-value interface a trivial invocation of the multi-value code. I think we 
should maximum reuse the transform function of a single-value to implement the 
multi-value one, but it can not completely shared code depends on different 
transformers.

I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder 
which is mostly common used.
 


was (Author: yanboliang):
[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
Brief design doc:
* How input and output columns will be specified
/** @group setParam */
  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
/** @group setParam */
  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
* Schema validation
Make transformSchema adaptive to multiple input and output columns.
* Code sharing to reduce duplication
For backwards compatibility, we must not modify current Params, we add a new 
one for multiple inputs (and check for conflicting settings when running). 
Reimplement transformers to support multi-value implementation and make the 
single-value interface a trivial invocation of the multi-value code. I think we 
should maximum reuse the transform function of a single-value to implement the 
multi-value one. 

I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder 
which is mostly common used.
 

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978
 ] 

Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:23 AM:
--

[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
Brief design doc:
* How input and output columns will be specified
/** @group setParam */
  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
/** @group setParam */
  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
* Schema validation
Make transformSchema adaptive to multiple input and output columns.
* Code sharing to reduce duplication
For backwards compatibility, we must not modify current Params, we add a new 
one for multiple inputs (and check for conflicting settings when running). 
Reimplement transformers to support multi-value implementation and make the 
single-value interface a trivial invocation of the multi-value code. I think we 
should maximum reuse the transform function of a single-value to implement the 
multi-value one. 
I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder 
which is mostly common used.
 


was (Author: yanboliang):
[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
I will firstly try to start with OneHotEncoder which is mostly common used.
 

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978
 ] 

Yanbo Liang edited comment on SPARK-8418 at 10/16/15 1:23 AM:
--

[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
Brief design doc:
* How input and output columns will be specified
/** @group setParam */
  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
/** @group setParam */
  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
* Schema validation
Make transformSchema adaptive to multiple input and output columns.
* Code sharing to reduce duplication
For backwards compatibility, we must not modify current Params, we add a new 
one for multiple inputs (and check for conflicting settings when running). 
Reimplement transformers to support multi-value implementation and make the 
single-value interface a trivial invocation of the multi-value code. I think we 
should maximum reuse the transform function of a single-value to implement the 
multi-value one. 

I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder 
which is mostly common used.
 


was (Author: yanboliang):
[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
Brief design doc:
* How input and output columns will be specified
/** @group setParam */
  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
/** @group setParam */
  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
* Schema validation
Make transformSchema adaptive to multiple input and output columns.
* Code sharing to reduce duplication
For backwards compatibility, we must not modify current Params, we add a new 
one for multiple inputs (and check for conflicting settings when running). 
Reimplement transformers to support multi-value implementation and make the 
single-value interface a trivial invocation of the multi-value code. I think we 
should maximum reuse the transform function of a single-value to implement the 
multi-value one. 
I will firstly try to start sub-tasks with StringIndexer and OneHotEncoder 
which is mostly common used.
 

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959978#comment-14959978
 ] 

Yanbo Liang commented on SPARK-8418:


[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
I will firstly try to start with OneHotEncoder which is mostly common used.
 

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot

2015-10-15 Thread Donam Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959960#comment-14959960
 ] 

Donam Kim commented on SPARK-10066:
---

I have same problem with Spark 1.5.1 on HDP 2.3.1

> Can't create HiveContext with spark-shell or spark-sql on snapshot
> --
>
> Key: SPARK-10066
> URL: https://issues.apache.org/jira/browse/SPARK-10066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.0
> Environment: Centos 6.6
>Reporter: Robert Beauchemin
>Priority: Minor
>
> Built the 1.5.0-preview-20150812 with the following:
> ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Psparkr -DskipTests
> Starting spark-shell or spark-sql returns the following error: 
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rwx--
> at 
> org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
>  [elided]
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)   
> 
> It's trying to create a new HiveContext. Running pySpark or sparkR works and 
> creates a HiveContext successfully. SqlContext can be created successfully 
> with any shell.
> I've tried changing permissions on that HDFS directory (even as far as making 
> it world-writable) without success. Tried changing SPARK_USER and also 
> running spark-shell as different users without success.
> This works on same machine on 1.4.1 and on earlier pre-release versions of 
> Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the 
> snapshot... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11109:


Assignee: (was: Apache Spark)

> move FsHistoryProvider off import 
> org.apache.hadoop.fs.permission.AccessControlException
> 
>
> Key: SPARK-11109
> URL: https://issues.apache.org/jira/browse/SPARK-11109
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{FsHistoryProvider}} imports and uses 
> {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been 
> superceded by its subclass 
> {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to 
> that subclass would remove a deprecation warning and ensure that were the 
> Hadoop team to remove that old method (as HADOOP-11356 has currently done to 
> trunk), everything still compiles and links



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959958#comment-14959958
 ] 

Apache Spark commented on SPARK-11109:
--

User 'gweidner' has created a pull request for this issue:
https://github.com/apache/spark/pull/9144

> move FsHistoryProvider off import 
> org.apache.hadoop.fs.permission.AccessControlException
> 
>
> Key: SPARK-11109
> URL: https://issues.apache.org/jira/browse/SPARK-11109
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{FsHistoryProvider}} imports and uses 
> {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been 
> superceded by its subclass 
> {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to 
> that subclass would remove a deprecation warning and ensure that were the 
> Hadoop team to remove that old method (as HADOOP-11356 has currently done to 
> trunk), everything still compiles and links



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11109:


Assignee: Apache Spark

> move FsHistoryProvider off import 
> org.apache.hadoop.fs.permission.AccessControlException
> 
>
> Key: SPARK-11109
> URL: https://issues.apache.org/jira/browse/SPARK-11109
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{FsHistoryProvider}} imports and uses 
> {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been 
> superceded by its subclass 
> {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to 
> that subclass would remove a deprecation warning and ensure that were the 
> Hadoop team to remove that old method (as HADOOP-11356 has currently done to 
> trunk), everything still compiles and links



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10513) Springleaf Marketing Response

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959949#comment-14959949
 ] 

Yanbo Liang edited comment on SPARK-10513 at 10/16/15 12:45 AM:


[~josephkb] For 4: If a column of StringType contains "" value (not null), 
StringIndexer will transform it right, but OneHotEncoder will throw exception 
caused by "" can not be assigned as a feature name. I think we should discuss 
whether it is legal that one category feature contains "" value, otherwise we 
should filter these kinds of values or replaced "" with other user specified 
values?


was (Author: yanboliang):
[~josephkb] For 4: If a column of StringType has "" value (not null), 
StringIndexer will transform it right, but OneHotEncoder will throw exception 
caused of "" can not as a feature name. I think we should discuss that whether 
it is legal that one category feature contains "" value, otherwise we should 
filter these kinds of values or replaced "" with other user specified values?

> Springleaf Marketing Response
> -
>
> Key: SPARK-10513
> URL: https://issues.apache.org/jira/browse/SPARK-10513
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10513) Springleaf Marketing Response

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959949#comment-14959949
 ] 

Yanbo Liang commented on SPARK-10513:
-

[~josephkb] For 4: If a column of StringType has "" value (not null), 
StringIndexer will transform it right, but OneHotEncoder will throw exception 
caused of "" can not as a feature name. I think we should discuss that whether 
it is legal that one category feature contains "" value, otherwise we should 
filter these kinds of values or replaced "" with other user specified values?

> Springleaf Marketing Response
> -
>
> Key: SPARK-10513
> URL: https://issues.apache.org/jira/browse/SPARK-10513
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11135.
--
   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9140
[https://github.com/apache/spark/pull/9140]

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.6.0, 1.5.2
>
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11141:


Assignee: (was: Apache Spark)

> Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
> --
>
> Key: SPARK-11141
> URL: https://issues.apache.org/jira/browse/SPARK-11141
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> When using S3 as a directory for WALs, the writes take too long. The driver 
> gets very easily bottlenecked when multiple receivers send AddBlock events to 
> the ReceiverTracker. This PR adds batching of events in the 
> ReceivedBlockTracker so that receivers don't get blocked by the driver for 
> too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959939#comment-14959939
 ] 

Apache Spark commented on SPARK-11141:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/9143

> Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
> --
>
> Key: SPARK-11141
> URL: https://issues.apache.org/jira/browse/SPARK-11141
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> When using S3 as a directory for WALs, the writes take too long. The driver 
> gets very easily bottlenecked when multiple receivers send AddBlock events to 
> the ReceiverTracker. This PR adds batching of events in the 
> ReceivedBlockTracker so that receivers don't get blocked by the driver for 
> too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11141:


Assignee: Apache Spark

> Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
> --
>
> Key: SPARK-11141
> URL: https://issues.apache.org/jira/browse/SPARK-11141
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> When using S3 as a directory for WALs, the writes take too long. The driver 
> gets very easily bottlenecked when multiple receivers send AddBlock events to 
> the ReceiverTracker. This PR adds batching of events in the 
> ReceivedBlockTracker so that receivers don't get blocked by the driver for 
> too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes

2015-10-15 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-11141:
---

 Summary: Batching of ReceivedBlockTrackerLogEvents for efficient 
WAL writes
 Key: SPARK-11141
 URL: https://issues.apache.org/jira/browse/SPARK-11141
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Burak Yavuz


When using S3 as a directory for WALs, the writes take too long. The driver 
gets very easily bottlenecked when multiple receivers send AddBlock events to 
the ReceiverTracker. This PR adds batching of events in the 
ReceivedBlockTracker so that receivers don't get blocked by the driver for too 
long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11102:


Assignee: (was: Apache Spark)

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11102:


Assignee: Apache Spark

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959897#comment-14959897
 ] 

Apache Spark commented on SPARK-11102:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9142

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10560:


Assignee: Apache Spark

> Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
> 
>
> Key: SPARK-10560
> URL: https://issues.apache.org/jira/browse/SPARK-10560
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> StreamingLogisticRegressionWithSGD Python API lacks of some parameters 
> compared with Scala one, here we make them equality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10560:


Assignee: (was: Apache Spark)

> Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
> 
>
> Key: SPARK-10560
> URL: https://issues.apache.org/jira/browse/SPARK-10560
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> StreamingLogisticRegressionWithSGD Python API lacks of some parameters 
> compared with Scala one, here we make them equality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959892#comment-14959892
 ] 

Apache Spark commented on SPARK-10560:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/9141

> Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
> 
>
> Key: SPARK-10560
> URL: https://issues.apache.org/jira/browse/SPARK-10560
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> StreamingLogisticRegressionWithSGD Python API lacks of some parameters 
> compared with Scala one, here we make them equality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-15 Thread Nick Pritchard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959876#comment-14959876
 ] 

Nick Pritchard commented on SPARK-11126:


Is there any workaround to avoid this memory leak?

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2629) Improve performance of DStream.updateStateByKey

2015-10-15 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2629:
-
Target Version/s: 1.6.0

> Improve performance of DStream.updateStateByKey
> ---
>
> Key: SPARK-2629
> URL: https://issues.apache.org/jira/browse/SPARK-2629
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2629) Improve performance of DStream.updateStateByKey

2015-10-15 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2629:
-
Affects Version/s: 0.9.2
   1.0.2
   1.2.2
   1.3.1
   1.4.1
   1.5.1

> Improve performance of DStream.updateStateByKey
> ---
>
> Key: SPARK-2629
> URL: https://issues.apache.org/jira/browse/SPARK-2629
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11140) Replace file server in driver with RPC-based alternative

2015-10-15 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-11140:
--

 Summary: Replace file server in driver with RPC-based alternative
 Key: SPARK-11140
 URL: https://issues.apache.org/jira/browse/SPARK-11140
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin


As part of making configuring encryption easy in Spark, it would be better to 
use the existing RPC channel between driver and executors to transfer files and 
jars added to the application.

This would remove the need to start the HTTP server currently used for that 
purpose, which needs to be configured to use SSL if encryption is wanted. SSL 
is kinda hard to configure correctly in a multi-user, distributed environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11139) Make SparkContext.stop() exception-safe

2015-10-15 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-11139:


 Summary: Make SparkContext.stop() exception-safe
 Key: SPARK-11139
 URL: https://issues.apache.org/jira/browse/SPARK-11139
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Felix Cheung
Priority: Minor


In SparkContext.stop(), when an exception is thrown the rest of the 
stop/cleanup action is aborted.

Work has been done in SPARK-4194 to allow for cleanup to partial initialization.

Similarly issue in StreamingContext SPARK-11137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11138) Flaky pyspark test: test_add_py_file

2015-10-15 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-11138:
--

 Summary: Flaky pyspark test: test_add_py_file
 Key: SPARK-11138
 URL: https://issues.apache.org/jira/browse/SPARK-11138
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin


This test fails pretty often when running PR tests. For example:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console

{noformat}
==
ERROR: test_add_py_file (__main__.AddFileTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
line 396, in test_add_py_file
res = self.sc.parallelize(range(2)).map(func).first()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 
1315, in first
rs = self.take(1)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 
1297, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py", 
line 923, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
self.target_id, self.name)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 
7, localhost): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
 line 111, in main
process()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
 line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py",
 line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 
1293, in takeUpToNumLeft
yield next(iterator)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
line 388, in func
from userlibrary import UserClass
ImportError: cannot import name UserClass

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1414)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:793)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1639)
at 
org.

[jira] [Created] (SPARK-11137) Make StreamingContext.stop() exception-safe

2015-10-15 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-11137:


 Summary: Make StreamingContext.stop() exception-safe
 Key: SPARK-11137
 URL: https://issues.apache.org/jira/browse/SPARK-11137
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.1
Reporter: Felix Cheung
Priority: Minor


In StreamingContext.stop(), when an exception is thrown the rest of the 
stop/cleanup action is aborted.

Discussed in https://github.com/apache/spark/pull/9116,
srowen commented
Hm, this is getting unwieldy. There are several nested try blocks here. The 
same argument goes for many of these methods -- if one fails should they not 
continue trying? A more tidy solution would be to execute a series of () -> 
Unit code blocks that perform some cleanup and make sure that they each fire in 
succession, regardless of the others. The final one to remove the shutdown hook 
could occur outside synchronization.

I realize we're expanding the scope of the change here, but is it maybe 
worthwhile to go all the way here?

Really, something similar could be done for SparkContext and there's an 
existing JIRA for it somewhere.

At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
the finally-related issue separately and all at once.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-15 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Summary: Uninformative exception when specifing non-exist input for JSON 
data source  (was: Unreadable exception when specifing non-exist input for JSON 
data source)

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11128) strange NPE when writing in non-existing S3 bucket

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11128:
--
Component/s: Input/Output

> strange NPE when writing in non-existing S3 bucket
> --
>
> Key: SPARK-11128
> URL: https://issues.apache.org/jira/browse/SPARK-11128
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.1
>Reporter: mathieu despriee
>Priority: Minor
>
> For the record, as it's relatively minor, and related to s3n (not tested with 
> s3a).
> By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, 
> with a simple df.write.parquet(s3path).
> We got a NPE (see stack trace below), which is very misleading.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11124) JsonParser/Generator should be closed for resource recycle

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11124:
--
Component/s: Spark Core

> JsonParser/Generator should be closed for resource recycle
> --
>
> Key: SPARK-11124
> URL: https://issues.apache.org/jira/browse/SPARK-11124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Navis
>Priority: Trivial
>
> Some json parsers are not closed. parser in JacksonParser#parseJson, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11123) Inprove HistoryServer with multithread to relay logs

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11123.
---
Resolution: Duplicate

[~xietingwen] please search JIRAs before opening a new one.

> Inprove HistoryServer with multithread to relay logs
> 
>
> Key: SPARK-11123
> URL: https://issues.apache.org/jira/browse/SPARK-11123
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xie Tingwen
>
> Now,with Spark 1.4,when I restart HistoryServer,it took over 30 hours to 
> replay over 40 000 log file. What's more,when I have started it,it may take 
> half an hour to relay it and block other logs to be replayed.How about 
> rewrite it with multithread to accelerate replay log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11068) Add callback to query execution

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11068:
--
Assignee: Wenchen Fan

> Add callback to query execution
> ---
>
> Key: SPARK-11068
> URL: https://issues.apache.org/jira/browse/SPARK-11068
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11076) Decimal Support for Ceil/Floor

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11076:
--
Assignee: Cheng Hao

> Decimal Support for Ceil/Floor
> --
>
> Key: SPARK-11076
> URL: https://issues.apache.org/jira/browse/SPARK-11076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> Currently, Ceil & Floor doesn't support decimal, but Hive does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11032) Failure to resolve having correctly

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11032:
--
Assignee: Wenchen Fan

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.6.0
>
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5391:
-
Assignee: Davies Liu

> SparkSQL fails to create tables with custom JSON SerDe
> --
>
> Key: SPARK-5391
> URL: https://issues.apache.org/jira/browse/SPARK-5391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: David Ross
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> - Using Spark built from trunk on this commit: 
> https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd
> - Build for Hive13
> - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde
> First download jar locally:
> {code}
> $ curl 
> http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar
>  > /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Then add it in SparkSQL session:
> {code}
> add jar /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Finally create table:
> {code}
> create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe';
> {code}
> Logs for add jar:
> {code}
> 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar'
> 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:48:34 INFO SessionState: Added 
> /tmp/json-serde-1.3-jar-with-dependencies.jar to class path
> 15/01/23 23:48:34 INFO SessionState: Added resource: 
> /tmp/json-serde-1.3-jar-with-dependencies.jar
> 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR 
> /tmp/json-serde-1.3-jar-with-dependencies.jar at 
> http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with 
> timestamp 1422056914776
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> {code}
> Logs (with error) for create table:
> {code}
> 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe''
> 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating 
> a lock manager
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941103 end=1422056941104 duration=1 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json 
> position=13
> 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941104 end=1422056941240 duration=136 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: 
> Schema(fieldSchemas:null, properties:null)
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941071 end=1422056941252 duration=181 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json 
> (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941067 end=1422056941258 duration=191 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 WARN security.ShellBasedUnixGroupsMapping: got exception 
> trying to get groups for user anonymous
> org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
>

[jira] [Updated] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-10-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10829:
--
Assignee: Cheng Hao

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Critical
> Fix For: 1.6.0
>
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11135:


Assignee: Josh Rosen  (was: Apache Spark)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959735#comment-14959735
 ] 

Apache Spark commented on SPARK-11135:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9140

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11135:


Assignee: Apache Spark  (was: Josh Rosen)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11136) Warm-start support for ML estimator

2015-10-15 Thread Xusen Yin (JIRA)

Xusen Yin created SPARK-11136:
-

 Summary: Warm-start support for ML estimator
 Key: SPARK-11136
 URL: https://issues.apache.org/jira/browse/SPARK-11136
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin
Priority: Minor


The current implementation of Estimator does not support warm-start fitting, 
i.e. estimator.fit(data, params, partialModel). But first we need to add 
warm-start for all ML estimators. This is an umbrella JIRA to add support for 
the warm-start estimator. 

Possible solutions:

1. Add warm-start fitting interface like def fit(dataset: DataFrame, initModel: 
M, paramMap: ParamMap): M

2. Treat model as a special parameter, passing it through ParamMap. e.g. val 
partialModel: Param[Option[M]] = new Param(...). In the case of model existing, 
we use it to warm-start, else we start the training process from the beginning.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10412) In SQL tab, show execution memory per physical operator

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10412.
---
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 1.6.0

> In SQL tab, show execution memory per physical operator
> ---
>
> Key: SPARK-10412
> URL: https://issues.apache.org/jira/browse/SPARK-10412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> We already display it per task / stage. It's really useful to also display it 
> per operator so the user can know which one caused all the memory to be 
> allocated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10515) When killing executor, the pending replacement executors will be lost

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10515.
---
  Resolution: Fixed
Assignee: KaiXinXIaoLei
   Fix Version/s: 1.6.0
  1.5.2
Target Version/s: 1.5.2, 1.6.0

> When killing executor, the pending replacement executors will be lost
> -
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
>Assignee: KaiXinXIaoLei
> Fix For: 1.5.2, 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-11071.
---
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
> Fix For: 1.6.0
>
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering

2015-10-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11135:
---
Summary: Exchange sort-planning logic incorrectly avoid sorts when existing 
ordering is subset of required ordering  (was: Exchange sort-planning logic may 
incorrect avoid sorts)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is subset of required ordering
> --
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11135:
---
Summary: Exchange sort-planning logic incorrectly avoid sorts when existing 
ordering is non-empty subset of required ordering  (was: Exchange sort-planning 
logic incorrectly avoid sorts when existing ordering is subset of required 
ordering)

> Exchange sort-planning logic incorrectly avoid sorts when existing ordering 
> is non-empty subset of required ordering
> 
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts

2015-10-15 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-11135:
--

 Summary: Exchange sort-planning logic may incorrect avoid sorts
 Key: SPARK-11135
 URL: https://issues.apache.org/jira/browse/SPARK-11135
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker


In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
where the data has already been sorted by a superset of the requested sorting 
columns. For instance, let's say that a query calls for an operator's input to 
be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, 
b.asc]`. In this case, we do not need to re-sort the input. The converse, 
however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` 
alone will not satisfy the ordering requirements, requiring an additional sort 
to be planned by Exchange.

However, the current Exchange code gets this wrong and incorrectly skips 
sorting when the existing output ordering is a subset of the required ordering. 
This is simple to fix, however.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts

2015-10-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11135:
---
Description: 
In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
where the data has already been sorted by a superset of the requested sorting 
columns. For instance, let's say that a query calls for an operator's input to 
be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, 
b.asc]`. In this case, we do not need to re-sort the input. The converse, 
however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` 
alone will not satisfy the ordering requirements, requiring an additional sort 
to be planned by Exchange.

However, the current Exchange code gets this wrong and incorrectly skips 
sorting when the existing output ordering is a subset of the required ordering. 
This is simple to fix, however.

This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
affects 1.5.0+.

  was:
In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
where the data has already been sorted by a superset of the requested sorting 
columns. For instance, let's say that a query calls for an operator's input to 
be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, 
b.asc]`. In this case, we do not need to re-sort the input. The converse, 
however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` 
alone will not satisfy the ordering requirements, requiring an additional sort 
to be planned by Exchange.

However, the current Exchange code gets this wrong and incorrectly skips 
sorting when the existing output ordering is a subset of the required ordering. 
This is simple to fix, however.


> Exchange sort-planning logic may incorrect avoid sorts
> --
>
> Key: SPARK-11135
> URL: https://issues.apache.org/jira/browse/SPARK-11135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases 
> where the data has already been sorted by a superset of the requested sorting 
> columns. For instance, let's say that a query calls for an operator's input 
> to be sorted by `a.asc` and the input happens to already be sorted by 
> `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The 
> converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then 
> `a.asc` alone will not satisfy the ordering requirements, requiring an 
> additional sort to be planned by Exchange.
> However, the current Exchange code gets this wrong and incorrectly skips 
> sorting when the existing output ordering is a subset of the required 
> ordering. This is simple to fix, however.
> This bug was introduced in https://github.com/apache/spark/pull/7458, so it 
> affects 1.5.0+.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11071:
--
Summary: Flaky test: o.a.s.launcher.LauncherServerSuite  (was: 
LauncherServerSuite::testTimeout is flaky)

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11071:
--
Labels: flaky-test  (was: )

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11071:
--
Component/s: (was: Spark Core)
 Tests

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11071
> URL: https://issues.apache.org/jira/browse/SPARK-11071
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>  Labels: flaky-test
>
> This test has failed a few times on jenkins, e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11134:
--
Labels: flaky-test  (was: )

> Flaky test: o.a.s.launcher.LauncherBackendSuite
> ---
>
> Key: SPARK-11134
> URL: https://issues.apache.org/jira/browse/SPARK-11134
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 110 times over 10.042591494 seconds. Last failure 
> message: The reference was null.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
>   at 
> org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite

2015-10-15 Thread Andrew Or (JIRA)

Andrew Or created SPARK-11134:
-

 Summary: Flaky test: o.a.s.launcher.LauncherBackendSuite
 Key: SPARK-11134
 URL: https://issues.apache.org/jira/browse/SPARK-11134
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Andrew Or
Priority: Critical


{code}
sbt.ForkMain$ForkError: The code passed to eventually never returned normally. 
Attempted 110 times over 10.042591494 seconds. Last failure message: The 
reference was null.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57)
at 
org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39)
at 
org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
at 
org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39)
{code}

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11133.

Resolution: Duplicate

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11133
> URL: https://issues.apache.org/jira/browse/SPARK-11133
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: Expected exception caused by connection timeout.
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)

Andrew Or created SPARK-11133:
-

 Summary: Flaky test: o.a.s.launcher.LauncherServerSuite
 Key: SPARK-11133
 URL: https://issues.apache.org/jira/browse/SPARK-11133
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Andrew Or
Priority: Critical


{code}
sbt.ForkMain$ForkError: Expected exception caused by connection timeout.
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
{code}

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite

2015-10-15 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11133:
--
Labels: flaky-test  (was: )

> Flaky test: o.a.s.launcher.LauncherServerSuite
> --
>
> Key: SPARK-11133
> URL: https://issues.apache.org/jira/browse/SPARK-11133
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: Expected exception caused by connection timeout.
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-15 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959604#comment-14959604
 ] 

Herman van Hovell commented on SPARK-9241:
--

It should grow linear (or am I missing something). For example if we have 3 
grouping sets (like in the example), we would duplicate and project the data 3x 
times. It is still bad, but similar to the approach in [~yhuai]'s example 
(saving a join). We could have a problem with the {{GROUPING__ID}} bitmask 
field, only 32/64 fields can be in a grouping set.

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-15 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599
 ] 

Zhan Zhang edited comment on SPARK-11087 at 10/15/15 8:58 PM:
--

[~patcharee] I try to duplicate your table as much as possible, but still 
didn't hit the problem.  Note that the query has to include some valid record 
in the partition. Otherwise, the partition pruning will trim all predicate 
before hitting the orc scan. Please refer to the below for the details.

case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, 
w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, 
el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int)

val records = (1 to 100).map { i =>
record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt)
}


sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D")
sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D")
val test = sqlContext.read.format("orc").load("4D")
test.registerTempTable("4D")
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z 
from 4D where x = and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 
8").show

2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)
leaf-1 = (EQUALS y 117)
expr = (and leaf-0 leaf-1)
2507   sqlContext.sql("select date, month, year, hh, u*0.9122461, 
u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and 
z >= 2 and z <= 8").show
2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 321)
leaf-1 = (EQUALS y 118)
expr = (and leaf-0 leaf-1)



was (Author: zzhan):
[~patcharee] I try to duplicate your table as much as possible, but still 
didn't hit the problem. Please refer to the below for the details.

case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, 
w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, 
el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int)

val records = (1 to 100).map { i =>
record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt)
}


sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D")
sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D")
val test = sqlContext.read.format("orc").load("4D")
2503   test.registerTempTable("4D")
2504   sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2505  sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, 
z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z 
<= 8").show

2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)
leaf-1 = (EQUALS y 117)
expr = (and leaf-0 leaf-1)
2507   sqlContext.sql("select date, month, year, hh, u*0.9122461, 
u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and 
z >= 2 and z <= 8").show
2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 321)
leaf-1 = (EQUALS y 118)
expr = (and leaf-0 leaf-1)


> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-15 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599
 ] 

Zhan Zhang commented on SPARK-11087:


[~patcharee] I try to duplicate your table as much as possible, but still 
didn't hit the problem. Please refer to the below for the details.

case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, 
w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, 
el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int)

val records = (1 to 100).map { i =>
record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, 
i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt)
}


sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D")
sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D")
val test = sqlContext.read.format("orc").load("4D")
2503   test.registerTempTable("4D")
2504   sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2505  sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, 
z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z 
<= 8").show

2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)
leaf-1 = (EQUALS y 117)
expr = (and leaf-0 leaf-1)
2507   sqlContext.sql("select date, month, year, hh, u*0.9122461, 
u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and 
z >= 2 and z <= 8").show
2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = 
(EQUALS x 321)
leaf-1 = (EQUALS y 118)
expr = (and leaf-0 leaf-1)


> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int

[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959592#comment-14959592
 ] 

Xusen Yin commented on SPARK-5874:
--

Sure I'll do it.

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959589#comment-14959589
 ] 

Joseph K. Bradley commented on SPARK-5874:
--

Sure, that sounds good.  Can you also please search for existing tickets and 
link them to the umbrella?

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory

2015-10-15 Thread Pratik Khadloya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571
 ] 

Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:39 PM:
--

Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

Also, i am not running in speculative mode.
.set("spark.speculation", "false")


was (Author: tispratik):
Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issue

[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory

2015-10-15 Thread Pratik Khadloya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571
 ] 

Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:40 PM:
--

Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}


Also, i am not running in speculative mode.
{code}
.set("spark.speculation", "false")
{code}


was (Author: tispratik):
Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

Also, i am not running in speculative mode.
.set("spark.speculation", "false")

> FileNotFoundException on _temporary directory
> ---

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2015-10-15 Thread Pratik Khadloya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571
 ] 

Pratik Khadloya commented on SPARK-2984:


Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as 
table ( saveAsTable ) using SaveMode.Overwrite.

{code}
15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for 
[flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: 
[BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp}
15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet
 (inode 2376521862): File does not exist. Holder 
DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.

[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959564#comment-14959564
 ] 

Xusen Yin commented on SPARK-5874:
--

I'd love to add supports to individual models first. But since there are many 
estimators in ML package now, I think we'd better add an umbrella JIRA to 
control the process. Can I create new JIRA subtask under this JIRA?

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959534#comment-14959534
 ] 

Apache Spark commented on SPARK-6488:
-

User 'dusenberrymw' has created a pull request for this issue:
https://github.com/apache/spark/pull/9139

> Support addition/multiplication in PySpark's BlockMatrix
> 
>
> Key: SPARK-6488
> URL: https://issues.apache.org/jira/browse/SPARK-6488
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Mike Dusenberry
>
> This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
> should reuse the Scala implementation instead of having a separate 
> implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6488:
---

Assignee: Mike Dusenberry  (was: Apache Spark)

> Support addition/multiplication in PySpark's BlockMatrix
> 
>
> Key: SPARK-6488
> URL: https://issues.apache.org/jira/browse/SPARK-6488
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Mike Dusenberry
>
> This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
> should reuse the Scala implementation instead of having a separate 
> implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6488:
---

Assignee: Apache Spark  (was: Mike Dusenberry)

> Support addition/multiplication in PySpark's BlockMatrix
> 
>
> Key: SPARK-6488
> URL: https://issues.apache.org/jira/browse/SPARK-6488
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
> should reuse the Scala implementation instead of having a separate 
> implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5657) Add PySpark Avro Output Format example

2015-10-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5657.
---
Resolution: Won't Fix

> Add PySpark Avro Output Format example
> --
>
> Key: SPARK-5657
> URL: https://issues.apache.org/jira/browse/SPARK-5657
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, PySpark
>Affects Versions: 1.2.0
>Reporter: Stanislav Los
>
> There is an Avro Input Format example that shows how to read Avro data in 
> PySpark, but nothing shows how to write from PySpark to Avro. The main 
> challenge, a Converter needs an Avro schema to build a record, but current 
> Spark API doesn't provide a way to supply extra parameters to custom 
> converters. Provided workaround is possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11039) Document all UI "retained*" configurations

2015-10-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11039:
---
Assignee: Nick Pritchard

> Document all UI "retained*" configurations
> --
>
> Key: SPARK-11039
> URL: https://issues.apache.org/jira/browse/SPARK-11039
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Web UI
>Affects Versions: 1.5.1
>Reporter: Nick Pritchard
>Assignee: Nick Pritchard
>Priority: Trivial
> Fix For: 1.5.2, 1.6.0
>
>
> Most are documented except these:
> - spark.sql.ui.retainedExecutions
> - spark.streaming.ui.retainedBatches
> They are really helpful for managing the memory usage of the driver 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11039) Document all UI "retained*" configurations

2015-10-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11039.

   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9052
[https://github.com/apache/spark/pull/9052]

> Document all UI "retained*" configurations
> --
>
> Key: SPARK-11039
> URL: https://issues.apache.org/jira/browse/SPARK-11039
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Web UI
>Affects Versions: 1.5.1
>Reporter: Nick Pritchard
>Priority: Trivial
> Fix For: 1.6.0, 1.5.2
>
>
> Most are documented except these:
> - spark.sql.ui.retainedExecutions
> - spark.streaming.ui.retainedBatches
> They are really helpful for managing the memory usage of the driver 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType

2015-10-15 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959451#comment-14959451
 ] 

Michael Armbrust commented on SPARK-8658:
-

There is no query that exposes the problem as its an internal quirk.  The 
{{equals}} method should check all of the specified fields for equality.  Today 
it is missing some.

> AttributeReference equals method only compare name, exprId and dataType
> ---
>
> Key: SPARK-8658
> URL: https://issues.apache.org/jira/browse/SPARK-8658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Antonio Jesus Navarro
>
> The AttributeReference "equals" method only accept as different objects with 
> different name, expression id or dataType. With this behavior when I tried to 
> do a "transformExpressionsDown" and try to transform qualifiers inside 
> "AttributeReferences", these objects are not replaced, because the 
> transformer considers them equal.
> I propose to add to the "equals" method this variables:
> name, dataType, nullable, metadata, epxrId, qualifiers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959440#comment-14959440
 ] 

Reynold Xin commented on SPARK-9241:


Do we have any idea on performance characteristics of this rewrite? IIUC, 
grouping set's complexity grows exponentially with the number of items in the 
set?

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map

2015-10-15 Thread Karl D. Gierach (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959298#comment-14959298
 ] 

Karl D. Gierach edited comment on SPARK-5739 at 10/15/15 7:06 PM:
--

Is there anyway to increase this block limit?  I'm hitting the same issue 
during a UnionRDD operation.

Also, above this issue's state is "resolved" but I'm not sure what the 
resolution is?  Maybe a state of "closed" with a reference to the duplicate 
ticket would make it more clear.



was (Author: kgierach):
Is there anyway to increase this block limit?  I'm hitting the same issue 
during a UnionRDD operation.

Also, above this issue's state is "resolved" but I'm not sure what the 
resolution is?


> Size exceeds Integer.MAX_VALUE in File Map
> --
>
> Key: SPARK-5739
> URL: https://issues.apache.org/jira/browse/SPARK-5739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1
> Environment: Spark1.1.1 on a cluster with 12 node. Every node with 
> 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a 
> node.
>Reporter: DjvuLee
>Priority: Minor
>
> I just run the kmeans algorithm using a random generate data,but occurred 
> this problem after some iteration. I try several time, and this problem is 
> reproduced. 
> Because the data is random generate, so I guess is there a bug ? Or if random 
> data can lead to such a scenario that the size is bigger than 
> Integer.MAX_VALUE, can we check the size before using the file map?
> 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.util.SizeEstimator - Failed to check whether 
> UseCompressedOops is set; assuming yes
> [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds 
> Integer.MAX_VALUE
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747)
>   at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598)
>   at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270)
>   at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
>   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)
>   at KMeansDataGenerator$.main(kmeans.scala:105)
>   at KMeansDataGenerator.main(kmeans.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
>   at java.lang.reflect.Method.invoke(Method.java:619)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11131:


Assignee: (was: Apache Spark)

> Worker registration protocol is racy
> 
>
> Key: SPARK-11131
> URL: https://issues.apache.org/jira/browse/SPARK-11131
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I ran into this while making changes to the new RPC framework. Because the 
> Worker registration protocol is based on sending unrelated messages between 
> Master and Worker, it's possible that another message (e.g. caused by an a 
> app trying to allocate workers) to arrive at the Worker before it knows the 
> Master has registered it. This triggers the following code:
> {code}
> case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
>   if (masterUrl != activeMasterUrl) {
> logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
> executor.")
> {code}
> This may or may not be made worse by SPARK-11098.
> A simple workaround is to use an {{ask}} instead of a {{send}} for these 
> messages. That should at least narrow the race. 
> Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
> tests, where Master and Worker instances are coming up as part of the app 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy

2015-10-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11131:


Assignee: Apache Spark

> Worker registration protocol is racy
> 
>
> Key: SPARK-11131
> URL: https://issues.apache.org/jira/browse/SPARK-11131
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> I ran into this while making changes to the new RPC framework. Because the 
> Worker registration protocol is based on sending unrelated messages between 
> Master and Worker, it's possible that another message (e.g. caused by an a 
> app trying to allocate workers) to arrive at the Worker before it knows the 
> Master has registered it. This triggers the following code:
> {code}
> case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
>   if (masterUrl != activeMasterUrl) {
> logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
> executor.")
> {code}
> This may or may not be made worse by SPARK-11098.
> A simple workaround is to use an {{ask}} instead of a {{send}} for these 
> messages. That should at least narrow the race. 
> Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
> tests, where Master and Worker instances are coming up as part of the app 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11131) Worker registration protocol is racy

2015-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959378#comment-14959378
 ] 

Apache Spark commented on SPARK-11131:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9138

> Worker registration protocol is racy
> 
>
> Key: SPARK-11131
> URL: https://issues.apache.org/jira/browse/SPARK-11131
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I ran into this while making changes to the new RPC framework. Because the 
> Worker registration protocol is based on sending unrelated messages between 
> Master and Worker, it's possible that another message (e.g. caused by an a 
> app trying to allocate workers) to arrive at the Worker before it knows the 
> Master has registered it. This triggers the following code:
> {code}
> case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
>   if (masterUrl != activeMasterUrl) {
> logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
> executor.")
> {code}
> This may or may not be made worse by SPARK-11098.
> A simple workaround is to use an {{ask}} instead of a {{send}} for these 
> messages. That should at least narrow the race. 
> Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
> tests, where Master and Worker instances are coming up as part of the app 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 149 matches

Mail list logo