[jira] [Commented] (SPARK-14284) Rename KMeansSummary.size to clusterSizes

2016-03-30 Thread Shally Sangal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219414#comment-15219414
 ] 

Shally Sangal commented on SPARK-14284:
---

I can take this up if no one has started on it.

> Rename KMeansSummary.size to clusterSizes
> -
>
> Key: SPARK-14284
> URL: https://issues.apache.org/jira/browse/SPARK-14284
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> In spark.ml KMeansSummary:
> We should deprecate the existing method {{size}} and create an alias called 
> {{clusterSizes}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14261) Memory leak in Spark Thrift Server

2016-03-30 Thread Xiaochun Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219304#comment-15219304
 ] 

Xiaochun Liang commented on SPARK-14261:


I did take heap dump when the serer is running. Unfortunately the heap dump 
file is too big to be uploaded, I just uploaded the screenshots of the heap 
dump. 

Some explanations on the screenshots of  heap dump:
1. The heap dump file was analyzed by MemoryAnalyzer,
2. 16716 is the process id,
3. 64g means the dump file was generated when the java process reaches 6.4g, 
while 80g means the dump file was generated when the java process reaches 8.0g

I looked through the memory under two cases and had following findings:
1. java.io.DeleteOnExitHook took more memories in both cases,
2. The memory of org.apache.hadoop.hive.ql.processors.CommandProcessorFactory 
increases a lot when memory reaches 8g,
3. The memory of org.apache.hadoop.conf.Configuration increases a lot when 
memory reaches 8g.


> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14153) My dataset does not provide proper predictions in ALS

2016-03-30 Thread Dulaj Rajitha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219298#comment-15219298
 ] 

Dulaj Rajitha commented on SPARK-14153:
---

Will you please give me a solution, because the training data-set I used might 
have some problem and I cannot understand what it is?

> My dataset does not provide proper predictions in ALS
> -
>
> Key: SPARK-14153
> URL: https://issues.apache.org/jira/browse/SPARK-14153
> Project: Spark
>  Issue Type: Question
>  Components: Java API, ML
>Reporter: Dulaj Rajitha
>
> When I used data-set in the git-hub example, I get proper predictions. But 
> when I used my data set It does not predict well. (I has a large RMSE). 
> I used cross validator for ALS  (in Spark ML) and here are the best model 
> parameters.
> 16/03/25 12:03:06 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN)
> 16/03/25 12:03:06 INFO CrossValidator: Best set of parameters:
> {
>   als_c911c0e183a3-alpha: 0.02,
>   als_c911c0e183a3-rank: 500,
>   als_c911c0e183a3-regParam: 0.03
> }
> But when I used movie data set It gives proper values for parameters. as below
> 16/03/24 14:07:07 INFO CrossValidator: Average cross-validation metrics: 
> WrappedArray(1.9481584447713676, 2.0501457159728944, 2.0600857505406935, 
> 1.9457234533860048, 2.0494498583414282, 2.0595306613827002, 
> 1.9488322049918922, 2.0489573853226797, 2.0584252131752, 1.9464006741621391, 
> 2.048241271354197, 2.057853990227443)
> 16/03/24 14:07:07 INFO CrossValidator: Best set of parameters:
> {
>   als_31a605e7717b-alpha: 0.02,
>   als_31a605e7717b-rank: 1,
>   als_31a605e7717b-regParam: 0.02
> }
> 16/03/24 14:07:07 INFO CrossValidator: Best cross-validation metric: 
> 1.9457234533860048.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14287) Method to determine if Dataset is bounded or not

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14287:


Assignee: (was: Apache Spark)

> Method to determine if Dataset is bounded or not
> 
>
> Key: SPARK-14287
> URL: https://issues.apache.org/jira/browse/SPARK-14287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Burak Yavuz
>
> With the addition of StreamExecution (ContinuousQuery) to Datasets, data will 
> become unbounded. With unbounded data, the execution of some methods and 
> operations will not make sense, e.g. Dataset.count().
> A simple API is required to check whether the data in a Dataset is bounded or 
> unbounded. This will allow users to check whether their Dataset is in 
> streaming mode or not. ML algorithms may check if the data is unbounded and 
> throw an exception for example.
> The implementation of this method is simple, however naming it is the 
> challenge. Some possible names for this method are:
>  - isStreaming
>  - isContinuous
>  - isBounded
>  - isUnbounded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14287) Method to determine if Dataset is bounded or not

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14287:


Assignee: Apache Spark

> Method to determine if Dataset is bounded or not
> 
>
> Key: SPARK-14287
> URL: https://issues.apache.org/jira/browse/SPARK-14287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> With the addition of StreamExecution (ContinuousQuery) to Datasets, data will 
> become unbounded. With unbounded data, the execution of some methods and 
> operations will not make sense, e.g. Dataset.count().
> A simple API is required to check whether the data in a Dataset is bounded or 
> unbounded. This will allow users to check whether their Dataset is in 
> streaming mode or not. ML algorithms may check if the data is unbounded and 
> throw an exception for example.
> The implementation of this method is simple, however naming it is the 
> challenge. Some possible names for this method are:
>  - isStreaming
>  - isContinuous
>  - isBounded
>  - isUnbounded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14287) Method to determine if Dataset is bounded or not

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219297#comment-15219297
 ] 

Apache Spark commented on SPARK-14287:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/12080

> Method to determine if Dataset is bounded or not
> 
>
> Key: SPARK-14287
> URL: https://issues.apache.org/jira/browse/SPARK-14287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Burak Yavuz
>
> With the addition of StreamExecution (ContinuousQuery) to Datasets, data will 
> become unbounded. With unbounded data, the execution of some methods and 
> operations will not make sense, e.g. Dataset.count().
> A simple API is required to check whether the data in a Dataset is bounded or 
> unbounded. This will allow users to check whether their Dataset is in 
> streaming mode or not. ML algorithms may check if the data is unbounded and 
> throw an exception for example.
> The implementation of this method is simple, however naming it is the 
> challenge. Some possible names for this method are:
>  - isStreaming
>  - isContinuous
>  - isBounded
>  - isUnbounded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14287) Method to determine if Dataset is bounded or not

2016-03-30 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-14287:

Summary: Method to determine if Dataset is bounded or not  (was: 
isStreaming method for Dataset)

> Method to determine if Dataset is bounded or not
> 
>
> Key: SPARK-14287
> URL: https://issues.apache.org/jira/browse/SPARK-14287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Burak Yavuz
>
> With the addition of StreamExecution (ContinuousQuery) to Datasets, data will 
> become unbounded. With unbounded data, the execution of some methods and 
> operations will not make sense, e.g. Dataset.count().
> A simple API is required to check whether the data in a Dataset is bounded or 
> unbounded. This will allow users to check whether their Dataset is in 
> streaming mode or not. ML algorithms may check if the data is unbounded and 
> throw an exception for example.
> The implementation of this method is simple, however naming it is the 
> challenge. Some possible names for this method are:
>  - isStreaming
>  - isContinuous
>  - isBounded
>  - isUnbounded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server

2016-03-30 Thread Xiaochun Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochun Liang updated SPARK-14261:
---
Attachment: 16716_heapdump_80g.PNG
16716_heapdump_64g.PNG

Screenshots of heap dumps of Spark Thrift server under 6.4g and 8.0g memory 
situation respectively.

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14287) isStreaming method for Dataset

2016-03-30 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-14287:
---

 Summary: isStreaming method for Dataset
 Key: SPARK-14287
 URL: https://issues.apache.org/jira/browse/SPARK-14287
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Burak Yavuz


With the addition of StreamExecution (ContinuousQuery) to Datasets, data will 
become unbounded. With unbounded data, the execution of some methods and 
operations will not make sense, e.g. Dataset.count().

A simple API is required to check whether the data in a Dataset is bounded or 
unbounded. This will allow users to check whether their Dataset is in streaming 
mode or not. ML algorithms may check if the data is unbounded and throw an 
exception for example.

The implementation of this method is simple, however naming it is the 
challenge. Some possible names for this method are:
 - isStreaming
 - isContinuous
 - isBounded
 - isUnbounded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-30 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219289#comment-15219289
 ] 

Yong Tang commented on SPARK-14238:
---

Hi [~mlnick], I created a pull request:
https://github.com/apache/spark/pull/12079
Let me know if you find any issues or there is anything I need to change. 
Thanks.

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14238:


Assignee: Apache Spark

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14238:


Assignee: (was: Apache Spark)

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219288#comment-15219288
 ] 

Apache Spark commented on SPARK-14238:
--

User 'yongtang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12079

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219262#comment-15219262
 ] 

Apache Spark commented on SPARK-13902:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12060

> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> I added the sample RDD graph to show the illegal stage graph to 
> {{DAGSchedulerSuite}} and then fixed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13112) CoarsedExecutorBackend register to driver should wait Executor was ready

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13112:


Assignee: (was: Apache Spark)

> CoarsedExecutorBackend register to driver should wait Executor was ready
> 
>
> Key: SPARK-13112
> URL: https://issues.apache.org/jira/browse/SPARK-13112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: SuYan
>
> desc: 
> due to some host's disk are busy, it will results failed in timeoutException 
> while executor try to register to shuffler server on that host... 
> and then it will exit(1) while launch task on a null executor.
> and yarn cluster resource are a little busy, yarn will thought that host is 
> idle, it will prefer to allocate the same host executor, so it will have a 
> chance that one task failed 4 times in the same host. 
> currently, CoarsedExecutorBackend register to driver first, and after 
> registerDriver successful, then initial Executor. 
> if exception occurs in Executor initialization,
> But Driver don't know that event, will still launch task in that executor,
> then will call system.exit(1). 
> {code}
>  override def receive: PartialFunction[Any, Unit] = { 
>   case RegisteredExecutor(hostname) => 
>   logInfo("Successfully registered with driver") executor = new 
> Executor(executorId, hostname, env, userClassPath, isLocal = false) 
> ..
> case LaunchTask(data) =>
>if (executor == null) {
> logError("Received LaunchTask command but executor was null")
> System.exit(1) 
> {code}
>  It is more reasonable to register with driver after Executor is ready... and 
> make registerTimeout to be configurable...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13112) CoarsedExecutorBackend register to driver should wait Executor was ready

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219237#comment-15219237
 ] 

Apache Spark commented on SPARK-13112:
--

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12078

> CoarsedExecutorBackend register to driver should wait Executor was ready
> 
>
> Key: SPARK-13112
> URL: https://issues.apache.org/jira/browse/SPARK-13112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: SuYan
>
> desc: 
> due to some host's disk are busy, it will results failed in timeoutException 
> while executor try to register to shuffler server on that host... 
> and then it will exit(1) while launch task on a null executor.
> and yarn cluster resource are a little busy, yarn will thought that host is 
> idle, it will prefer to allocate the same host executor, so it will have a 
> chance that one task failed 4 times in the same host. 
> currently, CoarsedExecutorBackend register to driver first, and after 
> registerDriver successful, then initial Executor. 
> if exception occurs in Executor initialization,
> But Driver don't know that event, will still launch task in that executor,
> then will call system.exit(1). 
> {code}
>  override def receive: PartialFunction[Any, Unit] = { 
>   case RegisteredExecutor(hostname) => 
>   logInfo("Successfully registered with driver") executor = new 
> Executor(executorId, hostname, env, userClassPath, isLocal = false) 
> ..
> case LaunchTask(data) =>
>if (executor == null) {
> logError("Received LaunchTask command but executor was null")
> System.exit(1) 
> {code}
>  It is more reasonable to register with driver after Executor is ready... and 
> make registerTimeout to be configurable...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13112) CoarsedExecutorBackend register to driver should wait Executor was ready

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13112:


Assignee: Apache Spark

> CoarsedExecutorBackend register to driver should wait Executor was ready
> 
>
> Key: SPARK-13112
> URL: https://issues.apache.org/jira/browse/SPARK-13112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: SuYan
>Assignee: Apache Spark
>
> desc: 
> due to some host's disk are busy, it will results failed in timeoutException 
> while executor try to register to shuffler server on that host... 
> and then it will exit(1) while launch task on a null executor.
> and yarn cluster resource are a little busy, yarn will thought that host is 
> idle, it will prefer to allocate the same host executor, so it will have a 
> chance that one task failed 4 times in the same host. 
> currently, CoarsedExecutorBackend register to driver first, and after 
> registerDriver successful, then initial Executor. 
> if exception occurs in Executor initialization,
> But Driver don't know that event, will still launch task in that executor,
> then will call system.exit(1). 
> {code}
>  override def receive: PartialFunction[Any, Unit] = { 
>   case RegisteredExecutor(hostname) => 
>   logInfo("Successfully registered with driver") executor = new 
> Executor(executorId, hostname, env, userClassPath, isLocal = false) 
> ..
> case LaunchTask(data) =>
>if (executor == null) {
> logError("Received LaunchTask command but executor was null")
> System.exit(1) 
> {code}
>  It is more reasonable to register with driver after Executor is ready... and 
> make registerTimeout to be configurable...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14286) Empty ORC table join throws exception

2016-03-30 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14286:


 Summary: Empty ORC table join throws exception
 Key: SPARK-14286
 URL: https://issues.apache.org/jira/browse/SPARK-14286
 Project: Spark
  Issue Type: Bug
Reporter: Rajesh Balamohan
Priority: Minor


When joining with an empty ORC table, sparks throws following exception. 

{noformat}
java.sql.SQLException: java.lang.IllegalArgumentException: orcFileOperator: 
path /apps/hive/warehouse/test.db/table does not have valid orc files 
matching the pattern
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14285) Improve user experience for typed aggregate functions

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219199#comment-15219199
 ] 

Apache Spark commented on SPARK-14285:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12077

> Improve user experience for typed aggregate functions
> -
>
> Key: SPARK-14285
> URL: https://issues.apache.org/jira/browse/SPARK-14285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> In the Dataset API, it is fairly difficult for users to perform simple 
> aggregations in a type-safe way at the moment because there are no 
> aggregators that have been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14285) Improve user experience for typed aggregate functions

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14285:


Assignee: Reynold Xin  (was: Apache Spark)

> Improve user experience for typed aggregate functions
> -
>
> Key: SPARK-14285
> URL: https://issues.apache.org/jira/browse/SPARK-14285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> In the Dataset API, it is fairly difficult for users to perform simple 
> aggregations in a type-safe way at the moment because there are no 
> aggregators that have been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14285) Improve user experience for typed aggregate functions

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14285:


Assignee: Apache Spark  (was: Reynold Xin)

> Improve user experience for typed aggregate functions
> -
>
> Key: SPARK-14285
> URL: https://issues.apache.org/jira/browse/SPARK-14285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> In the Dataset API, it is fairly difficult for users to perform simple 
> aggregations in a type-safe way at the moment because there are no 
> aggregators that have been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14285) Improve user experience for typed aggregate functions

2016-03-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14285:
---

 Summary: Improve user experience for typed aggregate functions
 Key: SPARK-14285
 URL: https://issues.apache.org/jira/browse/SPARK-14285
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


In the Dataset API, it is fairly difficult for users to perform simple 
aggregations in a type-safe way at the moment because there are no aggregators 
that have been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1359) SGD implementation is not efficient

2016-03-30 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219190#comment-15219190
 ] 

Yu Ishikawa commented on SPARK-1359:


[~mbaddar] Since the current ann in mllib depends on `GradientDescent`, we 
should modify the efficienty.
How do we evaluate new implementation against the current implementation? And 
What are better tasks to evaluate it?
- Metrics
1. Convergence Effieiency
2. Compute Cost
3. Compute Time
4. Other
- Task
1. Logistic Regression and Linear Regression with random generated data
2. Logistic Regression and Linear Regression with any Kaggle data
3. Other

I make an implementation of Parallelized Stochastic Gradient Descent.
https://github.com/yu-iskw/spark-parallelized-sgd

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14206) buildReader implementation for CSV

2016-03-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14206.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12002
[https://github.com/apache/spark/pull/12002]

> buildReader implementation for CSV
> --
>
> Key: SPARK-14206
> URL: https://issues.apache.org/jira/browse/SPARK-14206
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14229) PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat

2016-03-30 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219150#comment-15219150
 ] 

Russell Jurney commented on SPARK-14229:


Luke Lovett on the mongo-hadoop project has confirmed that this is a bug. At 
least, we think so, it may just be something that hasn't been implemented yet. 
It turns out that DataFrame.rdd's are not equivalent to normal RDDs, in terms 
of serialization.

For more information: https://jira.mongodb.org/browse/HADOOP-276

I think this is a pretty big bug, and should be fixed. However, there is a 
workaround that Luke found which may point to a solution: 

{code}
# Turn each Row --> dict, so that they can be unpickled and stored in MongoDB.
on_time_dataframe = on_time_dataframe.map(lambda row: row.asDict())
on_time_dataframe.rdd.saveToMongoDB(
'mongodb://localhost:27017/sparktest.dataframes')
{code}

Inane commentary: I'm thinking this shows how not many people are actually 
using PySpark for real-world workloads, for me to be the one to discover this? 
Saving to a database is a pretty common use case, and DataFrames must make up a 
lot of the workload.

> PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat
> --
>
> Key: SPARK-14229
> URL: https://issues.apache.org/jira/browse/SPARK-14229
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, Spark Shell
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>
> I am able to save data to MongoDB from any RDD... provided that RDD does not 
> belong to a DataFrame. If I use DataFrame.rdd, it is not possible to save via 
> saveAsNewAPIHadoopFile whatsoever. I have tested that this applies to saving 
> to MongoDB, BSON Files, and ElasticSearch.
> I get the following error when I try to save to a HadoopFile:
> config = {"mongo.output.uri": 
> "mongodb://localhost:27017/agile_data_science.on_time_performance"}
> n [3]: on_time_dataframe.rdd.saveAsNewAPIHadoopFile(
>...:   path='file://unused', 
>...:   outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
>...:   keyClass='org.apache.hadoop.io.Text', 
>...:   valueClass='org.apache.hadoop.io.MapWritable', 
>...:   conf=config
>...: )
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 62.7 KB, free 147.3 KB)
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1_piece0 stored 
> as bytes in memory (estimated size 20.4 KB, free 167.7 KB)
> 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
> memory on localhost:61301 (size: 20.4 KB, free: 511.1 MB)
> 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 1 from 
> javaToPython at NativeMethodAccessorImpl.java:-2
> 16/03/28 19:59:57 INFO Configuration.deprecation: mapred.min.split.size is 
> deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
> 16/03/28 19:59:57 INFO parquet.ParquetRelation: Reading Parquet file(s) from 
> file:/Users/rjurney/Software/Agile_Data_Code_2/data/on_time_performance.parquet/part-r-0-32089f1b-5447-4a75-b008-4fd0a0a8b846.gz.parquet
> 16/03/28 19:59:57 INFO spark.SparkContext: Starting job: take at 
> SerDeUtil.scala:231
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Got job 1 (take at 
> SerDeUtil.scala:231) with 1 output partitions
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 
> (take at SerDeUtil.scala:231)
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting ResultStage 1 
> (MapPartitionsRDD[6] at mapPartitions at SerDeUtil.scala:146), which has no 
> missing parents
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 14.9 KB, free 182.6 KB)
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 7.5 KB, free 190.1 KB)
> 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on localhost:61301 (size: 7.5 KB, free: 511.1 MB)
> 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1006
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 1 (MapPartitionsRDD[6] at mapPartitions at 
> SerDeUtil.scala:146)
> 16/03/28 19:59:57 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 
> 1 tasks
> 16/03/28 19:59:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 1.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2739 bytes)
> 16/03/28 19:59:57 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 
> 8)
> 16/03/28 19:59:58 INFO 

[jira] [Commented] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-30 Thread Denton Cockburn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219137#comment-15219137
 ] 

Denton Cockburn commented on SPARK-13801:
-

I'm unsure if this is the same issue, but I hit upon this problem.

{code}
import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = Seq((1, 1,1,1), (2,2,2,2)).toDF("a", "b", "c", "d")
val f = df.where($"a" === 1).alias("a")
val s = df.where($"a" === 2).alias("b")

f.join(s, f("b") === s("b") and f("c") === s("c"), 
"outer").select(coalesce(f("b"), s("b")), coalesce(f("c"), s("c")), 
coalesce(f("d"), s("d"))).show
{code}

The output is:
{code}
|coalesce(b,b)|coalesce(c,c)|coalesce(d,d)|
+-+-+-+
|1|1|1|
| null| null| null|
+-+-+-+
{code}

Instead of:
{code}
|coalesce(b,b)|coalesce(c,c)|coalesce(d,d)|
+-+-+-+
|1|1|1|
|2|2|2|
+-+-+-+
{code}

> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We 
> can still do the resolution to make sure the given column name is resolvable, 
> but don't return the resolved one, just get the name out and wrap it with 
> UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14087:
--
Target Version/s: 2.0.0

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit

2016-03-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219103#comment-15219103
 ] 

Joseph K. Bradley commented on SPARK-14087:
---

Linking with [SPARK-10931] since these two patches may conflict

> PySpark ML JavaModel does not properly own params after being fit
> -
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14081) DataFrameNaFunctions fill should not convert float fields to double

2016-03-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14081.
-
   Resolution: Fixed
 Assignee: Travis Crawford
Fix Version/s: 2.0.0

> DataFrameNaFunctions fill should not convert float fields to double
> ---
>
> Key: SPARK-14081
> URL: https://issues.apache.org/jira/browse/SPARK-14081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Travis Crawford
>Assignee: Travis Crawford
> Fix For: 2.0.0
>
>
> [DataFrameNaFunctions|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala]
>  provides useful function for dealing with null values in a DataFrame. 
> Currently it changes FloatType columns to DoubleType when zero filling. Spark 
> should preserve the column data type.
> In the following example, notice how `zeroFilledDF` has its `floatField` 
> converted from float to double.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(
>   StructField("intField", IntegerType),
>   StructField("longField", LongType),
>   StructField("floatField", FloatType),
>   StructField("doubleField", DoubleType)))
> val rdd = sc.parallelize(Seq(Row(1,1L,1f,1d), Row(null,null,null,null)))
> val df = sqlContext.createDataFrame(rdd, schema)
> val zeroFilledDF = df.na.fill(0)
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(intField,IntegerType,true), 
> StructField(longField,LongType,true), StructField(floatField,FloatType,true), 
> StructField(doubleField,DoubleType,true))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[2] at parallelize at :48
> df: org.apache.spark.sql.DataFrame = [intField: int, longField: bigint, 
> floatField: float, doubleField: double]
> zeroFilledDF: org.apache.spark.sql.DataFrame = [intField: int, longField: 
> bigint, floatField: double, doubleField: double]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13538) Add GaussianMixture to ML

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13538:
--
Shepherd: Joseph K. Bradley
Assignee: zhengruifeng
Target Version/s: 2.0.0

> Add GaussianMixture to ML
> -
>
> Key: SPARK-13538
> URL: https://issues.apache.org/jira/browse/SPARK-13538
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> Add GaussianMixture and GaussianMixtureModel to ML package



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14284) Rename KMeansSummary.size to clusterSizes

2016-03-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14284:
-

 Summary: Rename KMeansSummary.size to clusterSizes
 Key: SPARK-14284
 URL: https://issues.apache.org/jira/browse/SPARK-14284
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial


In spark.ml KMeansSummary:
We should deprecate the existing method {{size}} and create an alias called 
{{clusterSizes}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14282) CodeFormatter should handle oneline comment with /* */ properly

2016-03-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14282.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> CodeFormatter should handle oneline comment with /* */ properly
> ---
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */   boolean isNull = false;
> /* 023 */   final Object[] values = new Object[2];
> /* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */ final int value9 = -1;
> /* 056 */ isNull6 = true;
> /* 057 */ value6 = value9;
> /* 058 */   } else {
> ...
> /* 077 */   return mutableRow;
> /* 078 */ }
> /* 079 */ }
> /* 080 */ 
> {code}
> *After*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */ boolean isNull = false;
> /* 023 */ final Object[] values = new Object[2];
> /* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */   final int value9 = -1;
> /* 056 */   isNull6 = true;
> /* 057 */   value6 = value9;
> /* 058 */ } else {
> ...
> /* 077 */ return mutableRow;
> /* 078 */   }
> /* 079 */ }
> /* 080 */ 
> {code}
> Also, this issue fixes the following too. (Similar with SPARK-14185)
> *Before*
> {code}
> 16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
> generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}
> *After*
> {code}
> 16/03/30 12:46:32 DEBUG WholeStageCodegen: 
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13286) JDBC driver doesn't report full exception

2016-03-30 Thread Paul Zaczkieiwcz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219065#comment-15219065
 ] 

Paul Zaczkieiwcz commented on SPARK-13286:
--

I looked through 
{{sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala}}
 and found out I could set the batch size to 1 in the {{java.util.Properties}} 
object ({{p.setProperty("batchsize", "1")}}). That fixed my problem, but of 
course completely removed the performance improvement gained from batching.

As for logging, {{JdbcUtils}} is the only place in the spark code that calls 
{{executeBatch}}. It does so twice.  Those two lines are probably the only 
lines where catching {{java.sql.BatchUpdateException}} is necessary.

> JDBC driver doesn't report full exception
> -
>
> Key: SPARK-13286
> URL: https://issues.apache.org/jira/browse/SPARK-13286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>Priority: Minor
>
> Testing some failure scenarios (inserting data into postgresql where there is 
> a schema mismatch) , there is an exception thrown (fine so far) however it 
> doesn't report the actual SQL error.  It refers to a getNextException call 
> but this is beyond my non-existant Java skills to deal with correctly.  
> Supporting this would help users to see the SQL error quickly and resolve the 
> underlying problem.
> {noformat}
> Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core 
> VALUES('5fdf5...',) was aborted.  Call getNextException to see the cause.
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14251) Add SQL command for printing out generated code for debugging

2016-03-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219054#comment-15219054
 ] 

Dongjoon Hyun commented on SPARK-14251:
---

Thanks! :)

> Add SQL command for printing out generated code for debugging
> -
>
> Key: SPARK-14251
> URL: https://issues.apache.org/jira/browse/SPARK-14251
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> SPARK-14227 adds a programatic way to dump generated code. In pure SQL 
> environment this doesn't work. It would be great if we can have 
> {noformat}
> explain codegen select * ...
> {noformat}
> return the generated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11507) Error thrown when using BlockMatrix.add

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11507.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.2
   1.5.3

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.1, 2.0.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.5.3, 1.6.2, 2.0.0
>
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11507) Error thrown when using BlockMatrix.add

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11507:
--
Target Version/s: 1.5.3, 1.6.2, 2.0.0  (was: 1.4.2, 1.5.3, 1.6.2, 2.0.0)

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.1, 2.0.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Assignee: yuhao yang
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14259) Add config to control maximum number of files when coalescing partitions

2016-03-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14259:
-
Assignee: Takeshi Yamamuro

> Add config to control maximum number of files when coalescing partitions
> 
>
> Key: SPARK-14259
> URL: https://issues.apache.org/jira/browse/SPARK-14259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> The FileSourceStrategy currently has a config to control the maximum byte 
> size of coalesced partitions. It is helpful to also have a config to control 
> the maximum number of files as even small files have a non-trivial fixed 
> cost. The current packing can put a lot of small files together which cases 
> straggler tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14259) Add config to control maximum number of files when coalescing partitions

2016-03-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14259.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12068
[https://github.com/apache/spark/pull/12068]

> Add config to control maximum number of files when coalescing partitions
> 
>
> Key: SPARK-14259
> URL: https://issues.apache.org/jira/browse/SPARK-14259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Priority: Minor
> Fix For: 2.0.0
>
>
> The FileSourceStrategy currently has a config to control the maximum byte 
> size of coalesced partitions. It is helpful to also have a config to control 
> the maximum number of files as even small files have a non-trivial fixed 
> cost. The current packing can put a lot of small files together which cases 
> straggler tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14251) Add SQL command for printing out generated code for debugging

2016-03-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219024#comment-15219024
 ] 

Reynold Xin commented on SPARK-14251:
-

Yes please go for it.


> Add SQL command for printing out generated code for debugging
> -
>
> Key: SPARK-14251
> URL: https://issues.apache.org/jira/browse/SPARK-14251
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> SPARK-14227 adds a programatic way to dump generated code. In pure SQL 
> environment this doesn't work. It would be great if we can have 
> {noformat}
> explain codegen select * ...
> {noformat}
> return the generated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14251) Add SQL command for printing out generated code for debugging

2016-03-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219017#comment-15219017
 ] 

Dongjoon Hyun commented on SPARK-14251:
---

Hi, [~rxin]. 
May I work on this issue? 

> Add SQL command for printing out generated code for debugging
> -
>
> Key: SPARK-14251
> URL: https://issues.apache.org/jira/browse/SPARK-14251
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> SPARK-14227 adds a programatic way to dump generated code. In pure SQL 
> environment this doesn't work. It would be great if we can have 
> {noformat}
> explain codegen select * ...
> {noformat}
> return the generated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10931:
--
Shepherd: Joseph K. Bradley
Target Version/s: 2.0.0

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14152) MultilayerPerceptronClassifier supports save/load for Python API

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14152:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11939

> MultilayerPerceptronClassifier supports save/load for Python API
> 
>
> Key: SPARK-14152
> URL: https://issues.apache.org/jira/browse/SPARK-14152
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> MultilayerPerceptronClassifier supports save/load for Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14152) MultilayerPerceptronClassifier supports save/load for Python API

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14152.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11952
[https://github.com/apache/spark/pull/11952]

> MultilayerPerceptronClassifier supports save/load for Python API
> 
>
> Key: SPARK-14152
> URL: https://issues.apache.org/jira/browse/SPARK-14152
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> MultilayerPerceptronClassifier supports save/load for Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13785) Deprecate model field in ML model summary classes

2016-03-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219004#comment-15219004
 ] 

Joseph K. Bradley commented on SPARK-13785:
---

Sure, go ahead please, but I'd prefer to deprecate it since there is no need to 
hurry in removing it.

We need to expose the column names for users to make use of the {{predictions}} 
DataFrame.  Otherwise, they won't be able to tell (programmatically) which 
columns are which.

> Deprecate model field in ML model summary classes
> -
>
> Key: SPARK-13785
> URL: https://issues.apache.org/jira/browse/SPARK-13785
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> ML model summary classes (e.g., LinearRegressionSummary) currently expose a 
> field "model" containing the parent model.  It's weird to have this circular 
> reference, and I don't see a good reason why the summary should expose it 
> (unless I'm forgetting some decision we made before...).
> I'd propose to deprecate that field in 2.0 and to remove it in 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-30 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218991#comment-15218991
 ] 

holdenk commented on SPARK-14141:
-

If the data fits in memory on the cluster, cache + count + toLocalIterator 
would probably be good (given the current setup).

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-03-30 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218989#comment-15218989
 ] 

Josh Rosen commented on SPARK-928:
--

It looks like we'll _finally_ be able to do this after SPARK-11416 goes in.

> Add support for Unsafe-based serializer in Kryo 2.22
> 
>
> Key: SPARK-928
> URL: https://issues.apache.org/jira/browse/SPARK-928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This can reportedly be quite a bit faster, but it also requires Chill to 
> update its Kryo dependency. Once that happens we should add a 
> spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11416) Upgrade kryo package to version 3.0

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11416:


Assignee: Josh Rosen  (was: Apache Spark)

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Assignee: Josh Rosen
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218982#comment-15218982
 ] 

Apache Spark commented on SPARK-11416:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12076

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Assignee: Josh Rosen
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11416) Upgrade kryo package to version 3.0

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11416:


Assignee: Apache Spark  (was: Josh Rosen)

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Assignee: Apache Spark
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11507) Error thrown when using BlockMatrix.add

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11507:
--
Shepherd: Joseph K. Bradley
Assignee: yuhao yang
Target Version/s: 1.4.2, 1.5.3, 1.6.2, 2.0.0

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.1, 2.0.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Assignee: yuhao yang
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11507) Error thrown when using BlockMatrix.add

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11507:
--
Affects Version/s: 2.0.0
   1.4.1
   1.6.1

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.1, 2.0.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13064) api/v1/application/jobs/attempt lacks "attempId" field for spark-shell

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13064:


Assignee: Apache Spark

> api/v1/application/jobs/attempt lacks "attempId" field for spark-shell
> --
>
> Key: SPARK-13064
> URL: https://issues.apache.org/jira/browse/SPARK-13064
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Zhuo Liu
>Assignee: Apache Spark
>Priority: Minor
>
> For any application launches with spark-shell will not have attemptId field 
> in their rest API. From the REST API point of view, we might want to force an 
> Id for it, i.e., "1".
> {code}
> {
>   "id" : "application_1453789230389_377545",
>   "name" : "PySparkShell",
>   "attempts" : [ {
> "startTime" : "2016-01-28T02:17:11.035GMT",
> "endTime" : "2016-01-28T02:30:01.355GMT",
> "lastUpdated" : "2016-01-28T02:30:01.516GMT",
> "duration" : 770320,
> "sparkUser" : "huyng",
> "completed" : true
>   } ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13064) api/v1/application/jobs/attempt lacks "attempId" field for spark-shell

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13064:


Assignee: (was: Apache Spark)

> api/v1/application/jobs/attempt lacks "attempId" field for spark-shell
> --
>
> Key: SPARK-13064
> URL: https://issues.apache.org/jira/browse/SPARK-13064
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Zhuo Liu
>Priority: Minor
>
> For any application launches with spark-shell will not have attemptId field 
> in their rest API. From the REST API point of view, we might want to force an 
> Id for it, i.e., "1".
> {code}
> {
>   "id" : "application_1453789230389_377545",
>   "name" : "PySparkShell",
>   "attempts" : [ {
> "startTime" : "2016-01-28T02:17:11.035GMT",
> "endTime" : "2016-01-28T02:30:01.355GMT",
> "lastUpdated" : "2016-01-28T02:30:01.516GMT",
> "duration" : 770320,
> "sparkUser" : "huyng",
> "completed" : true
>   } ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13064) api/v1/application/jobs/attempt lacks "attempId" field for spark-shell

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218970#comment-15218970
 ] 

Apache Spark commented on SPARK-13064:
--

User 'zhuoliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/12075

> api/v1/application/jobs/attempt lacks "attempId" field for spark-shell
> --
>
> Key: SPARK-13064
> URL: https://issues.apache.org/jira/browse/SPARK-13064
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Zhuo Liu
>Priority: Minor
>
> For any application launches with spark-shell will not have attemptId field 
> in their rest API. From the REST API point of view, we might want to force an 
> Id for it, i.e., "1".
> {code}
> {
>   "id" : "application_1453789230389_377545",
>   "name" : "PySparkShell",
>   "attempts" : [ {
> "startTime" : "2016-01-28T02:17:11.035GMT",
> "endTime" : "2016-01-28T02:30:01.355GMT",
> "lastUpdated" : "2016-01-28T02:30:01.516GMT",
> "duration" : 770320,
> "sparkUser" : "huyng",
> "completed" : true
>   } ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14152) MultilayerPerceptronClassifier supports save/load for Python API

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14152:
--
Shepherd: Joseph K. Bradley
Assignee: Yanbo Liang
Target Version/s: 2.0.0

> MultilayerPerceptronClassifier supports save/load for Python API
> 
>
> Key: SPARK-14152
> URL: https://issues.apache.org/jira/browse/SPARK-14152
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> MultilayerPerceptronClassifier supports save/load for Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7425:
-
Target Version/s: 2.0.0

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Benjamin Fradet
>Priority: Minor
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7425:
-
Labels:   (was: starter)

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Benjamin Fradet
>Priority: Minor
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14277) UnsafeSorterSpillReader should do buffered read from underlying compression stream

2016-03-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14277:
---
Assignee: Sital Kedia

> UnsafeSorterSpillReader should do buffered read from underlying compression 
> stream
> --
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and we pay a heavy cost of jni calls for 
> these small reads. The SpillReader should instead do a buffered read from the 
> underlying snappy compressed stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14277) UnsafeSorterSpillReader should do buffered read from underlying compression stream

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14277:


Assignee: Apache Spark

> UnsafeSorterSpillReader should do buffered read from underlying compression 
> stream
> --
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and we pay a heavy cost of jni calls for 
> these small reads. The SpillReader should instead do a buffered read from the 
> underlying snappy compressed stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14277) UnsafeSorterSpillReader should do buffered read from underlying compression stream

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218939#comment-15218939
 ] 

Apache Spark commented on SPARK-14277:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/12074

> UnsafeSorterSpillReader should do buffered read from underlying compression 
> stream
> --
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and we pay a heavy cost of jni calls for 
> these small reads. The SpillReader should instead do a buffered read from the 
> underlying snappy compressed stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14230) Config the start time (jitter) for streaming jobs

2016-03-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218938#comment-15218938
 ] 

Davies Liu commented on SPARK-14230:


For non-window batch, could be supported via trigger, see 
https://github.com/apache/spark/pull/11976/files

> Config the start time (jitter) for streaming jobs
> -
>
> Key: SPARK-14230
> URL: https://issues.apache.org/jira/browse/SPARK-14230
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Liyin Tang
>
> Currently, RecurringTimer will normalize the start time. For instance, if 
> batch duration is 1 min, all the job will start exactly at 1 min boundary. 
> This actually adds some burden to the streaming source. Assuming the source 
> is Kafka, and there is a list of streaming jobs with 1 min batch duration, 
> then at first few seconds of each min, high network traffic will be observed 
> in Kafka. This makes Kafka capacity planning tricky. 
> It will be great to have an option in the streaming context to set the job 
> start time. In this way, user can add a jitter for the start time for each, 
> and make Kafka fetch_request much smooth across the duration window.
> {code}
> class RecurringTimer {
>   def getStartTime(): Long = {
> (math.floor(clock.currentTime.toDouble / period) + 1).toLong * period + 
> jitter
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-30 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218937#comment-15218937
 ] 

JESSE CHEN commented on SPARK-13820:


We are able to run 93 now. We should shoot for all 99. And this JIRA will fix 2 
more :)

> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14277) UnsafeSorterSpillReader should do buffered read from underlying compression stream

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14277:


Assignee: (was: Apache Spark)

> UnsafeSorterSpillReader should do buffered read from underlying compression 
> stream
> --
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and we pay a heavy cost of jni calls for 
> these small reads. The SpillReader should instead do a buffered read from the 
> underlying snappy compressed stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-30 Thread Luke Miner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218930#comment-15218930
 ] 

Luke Miner commented on SPARK-14141:


Anecdotally, at least, it seems like a pretty common workflow for data 
scientists is to use Spark to preprocess data down to a size where it can be 
sent through scikit-learn/theano/nltk/tensorflow/etc. So I'd imagine that 
there'd be a fair amount of uptake on a feature that would make it painless to 
get large datasets from Spark into pandas on a single machine. I don't know if 
this means that it belongs in Spark, but I'd find it very useful even if it is 
a little slow!

Incidentally, I added an issue to Pandas to allow dtypes to be specified in the 
`from_records()` constructor. I figured this would be useful regardless, if 
only to make it easier preserve some more type information during the 
conversion (e.g. datetime columns): 
https://github.com/pydata/pandas/issues/12751

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11416) Upgrade kryo package to version 3.0

2016-03-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-11416:
--

Assignee: Josh Rosen

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Assignee: Josh Rosen
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14281) Fix the java8-tests profile and run those tests in Jenkins

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218914#comment-15218914
 ] 

Apache Spark commented on SPARK-14281:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12073

> Fix the java8-tests profile and run those tests in Jenkins
> --
>
> Key: SPARK-14281
> URL: https://issues.apache.org/jira/browse/SPARK-14281
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark has some tests for compilation of Java 8 sources (using lambdas) 
> guarded behind a {{java8-tests}} maven profile, but we currently do not build 
> or run those tests. As a result, the tests no longer compile.
> We should fix these tests and set up automated CI so that they don't break 
> again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-03-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218909#comment-15218909
 ] 

Davies Liu commented on SPARK-14141:


toLocalIterator is better than collect, but will run partitions in seqential 
way (will be slow).

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218906#comment-15218906
 ] 

Davies Liu commented on SPARK-13820:


[~jfc...@us.ibm.com] How much modification have you done? about 78 queries 
could run on my side, see 
https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS_1_4_Queries.scala

> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13286) JDBC driver doesn't report full exception

2016-03-30 Thread Paul Zaczkieiwcz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218900#comment-15218900
 ] 

Paul Zaczkieiwcz commented on SPARK-13286:
--

I'm seeing this in my production code that used to work in Spark 1.5.1. I 
understand that spark 1.6.1 switched from doing individual INSERTs in its JDBC 
output to batching 1000 INSERTs at a time. I can't even reproduce the error by 
running the same INSERT statement directly in psql (I'm outputting to 
Postgres). I'm at a loss what could be causing this, particularly with 
SaveMode.Overwrite.  There isn't any possibility of a schema mismatch unless 
there is a race condition causing the workers to try to insert into the table 
before it exists.

> JDBC driver doesn't report full exception
> -
>
> Key: SPARK-13286
> URL: https://issues.apache.org/jira/browse/SPARK-13286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>Priority: Minor
>
> Testing some failure scenarios (inserting data into postgresql where there is 
> a schema mismatch) , there is an exception thrown (fine so far) however it 
> doesn't report the actual SQL error.  It refers to a getNextException call 
> but this is beyond my non-existant Java skills to deal with correctly.  
> Supporting this would help users to see the SQL error quickly and resolve the 
> underlying problem.
> {noformat}
> Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core 
> VALUES('5fdf5...',) was aborted.  Call getNextException to see the cause.
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14282) CodeFormatter should handle oneline comment with /* */ properly

2016-03-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14282:
--
Description: 
This issue improves `CodeFormatter` to fix the following cases.

*Before*
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */   boolean isNull = false;
/* 023 */   final Object[] values = new Object[2];
/* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */ final int value9 = -1;
/* 056 */ isNull6 = true;
/* 057 */ value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */ 
{code}

*After*
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[2];
/* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */   final int value9 = -1;
/* 056 */   isNull6 = true;
/* 057 */   value6 = value9;
/* 058 */ } else {
...
/* 077 */ return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */ 
{code}

Also, this issue fixes the following too. (Similar with SPARK-14185)
*Before*
{code}
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}

*After*
{code}
16/03/30 12:46:32 DEBUG WholeStageCodegen: 
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}



  was:
This issue improves `CodeFormatter` to fix the following cases.

*Before*
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */   boolean isNull = false;
/* 023 */   final Object[] values = new Object[2];
/* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */ final int value9 = -1;
/* 056 */ isNull6 = true;
/* 057 */ value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */ 
{code}

*After*
{code}
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */   final int value9 = -1;
/* 056 */   isNull6 = true;
/* 057 */   value6 = value9;
/* 058 */ } else {
...
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[2];
/* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 077 */ return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */ 
{code}

Also, this issue fixes the following too. (Similar with SPARK-14185)
*Before*
{code}
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}

*After*
{code}
16/03/30 12:46:32 DEBUG WholeStageCodegen: 
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}




> CodeFormatter should handle oneline comment with /* */ properly
> ---
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, 

[jira] [Updated] (SPARK-14282) CodeFormatter should handle oneline comment with /* */ properly

2016-03-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14282:
--
Issue Type: Bug  (was: Improvement)

> CodeFormatter should handle oneline comment with /* */ properly
> ---
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */   boolean isNull = false;
> /* 023 */   final Object[] values = new Object[2];
> /* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */ final int value9 = -1;
> /* 056 */ isNull6 = true;
> /* 057 */ value6 = value9;
> /* 058 */   } else {
> ...
> /* 077 */   return mutableRow;
> /* 078 */ }
> /* 079 */ }
> /* 080 */ 
> {code}
> *After*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */ boolean isNull = false;
> /* 023 */ final Object[] values = new Object[2];
> /* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */   final int value9 = -1;
> /* 056 */   isNull6 = true;
> /* 057 */   value6 = value9;
> /* 058 */ } else {
> ...
> /* 077 */ return mutableRow;
> /* 078 */   }
> /* 079 */ }
> /* 080 */ 
> {code}
> Also, this issue fixes the following too. (Similar with SPARK-14185)
> *Before*
> {code}
> 16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
> generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}
> *After*
> {code}
> 16/03/30 12:46:32 DEBUG WholeStageCodegen: 
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14283) Avoid sort in randomSplit when possible

2016-03-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14283:
-

 Summary: Avoid sort in randomSplit when possible
 Key: SPARK-14283
 URL: https://issues.apache.org/jira/browse/SPARK-14283
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Joseph K. Bradley


Dataset.randomSplit sorts each partition in order to guarantee an ordering and 
make randomSplit deterministic given the seed.  Since randomSplit is used a 
fair amount in ML, it would be great to avoid the sort when possible.

Are there cases when it could be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14282) CodeFormatter should handle oneline comment with /* */ properly

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14282:


Assignee: Apache Spark

> CodeFormatter should handle oneline comment with /* */ properly
> ---
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */   boolean isNull = false;
> /* 023 */   final Object[] values = new Object[2];
> /* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */ final int value9 = -1;
> /* 056 */ isNull6 = true;
> /* 057 */ value6 = value9;
> /* 058 */   } else {
> ...
> /* 077 */   return mutableRow;
> /* 078 */ }
> /* 079 */ }
> /* 080 */ 
> {code}
> *After*
> {code}
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */   final int value9 = -1;
> /* 056 */   isNull6 = true;
> /* 057 */   value6 = value9;
> /* 058 */ } else {
> ...
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */ boolean isNull = false;
> /* 023 */ final Object[] values = new Object[2];
> /* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 077 */ return mutableRow;
> /* 078 */   }
> /* 079 */ }
> /* 080 */ 
> {code}
> Also, this issue fixes the following too. (Similar with SPARK-14185)
> *Before*
> {code}
> 16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
> generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}
> *After*
> {code}
> 16/03/30 12:46:32 DEBUG WholeStageCodegen: 
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14282) CodeFormatter should handle oneline comment with /* */ properly

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218876#comment-15218876
 ] 

Apache Spark commented on SPARK-14282:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12072

> CodeFormatter should handle oneline comment with /* */ properly
> ---
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */   boolean isNull = false;
> /* 023 */   final Object[] values = new Object[2];
> /* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */ final int value9 = -1;
> /* 056 */ isNull6 = true;
> /* 057 */ value6 = value9;
> /* 058 */   } else {
> ...
> /* 077 */   return mutableRow;
> /* 078 */ }
> /* 079 */ }
> /* 080 */ 
> {code}
> *After*
> {code}
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */   final int value9 = -1;
> /* 056 */   isNull6 = true;
> /* 057 */   value6 = value9;
> /* 058 */ } else {
> ...
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */ boolean isNull = false;
> /* 023 */ final Object[] values = new Object[2];
> /* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 077 */ return mutableRow;
> /* 078 */   }
> /* 079 */ }
> /* 080 */ 
> {code}
> Also, this issue fixes the following too. (Similar with SPARK-14185)
> *Before*
> {code}
> 16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
> generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}
> *After*
> {code}
> 16/03/30 12:46:32 DEBUG WholeStageCodegen: 
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14282) CodeFormatter should handle oneline comment with /* */ properly

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14282:


Assignee: (was: Apache Spark)

> CodeFormatter should handle oneline comment with /* */ properly
> ---
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */   boolean isNull = false;
> /* 023 */   final Object[] values = new Object[2];
> /* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */ final int value9 = -1;
> /* 056 */ isNull6 = true;
> /* 057 */ value6 = value9;
> /* 058 */   } else {
> ...
> /* 077 */   return mutableRow;
> /* 078 */ }
> /* 079 */ }
> /* 080 */ 
> {code}
> *After*
> {code}
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */   final int value9 = -1;
> /* 056 */   isNull6 = true;
> /* 057 */   value6 = value9;
> /* 058 */ } else {
> ...
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */ boolean isNull = false;
> /* 023 */ final Object[] values = new Object[2];
> /* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 077 */ return mutableRow;
> /* 078 */   }
> /* 079 */ }
> /* 080 */ 
> {code}
> Also, this issue fixes the following too. (Similar with SPARK-14185)
> *Before*
> {code}
> 16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
> generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}
> *After*
> {code}
> 16/03/30 12:46:32 DEBUG WholeStageCodegen: 
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14282) CodeFormatter should handle oneline comment with /* */ properly

2016-03-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14282:
--
Summary: CodeFormatter should handle oneline comment with /* */ properly  
(was: CodeFormatter should handle oneline comment with /* */)

> CodeFormatter should handle oneline comment with /* */ properly
> ---
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */   boolean isNull = false;
> /* 023 */   final Object[] values = new Object[2];
> /* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */ final int value9 = -1;
> /* 056 */ isNull6 = true;
> /* 057 */ value6 = value9;
> /* 058 */   } else {
> ...
> /* 077 */   return mutableRow;
> /* 078 */ }
> /* 079 */ }
> /* 080 */ 
> {code}
> *After*
> {code}
> /* 053 */ if (!false && false) {
> /* 054 */   /* null */
> /* 055 */   final int value9 = -1;
> /* 056 */   isNull6 = true;
> /* 057 */   value6 = value9;
> /* 058 */ } else {
> ...
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
> /* 022 */ boolean isNull = false;
> /* 023 */ final Object[] values = new Object[2];
> /* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
> /* 025 */ /* isnull(input[0, double]) */
> ...
> /* 077 */ return mutableRow;
> /* 078 */   }
> /* 079 */ }
> /* 080 */ 
> {code}
> Also, this issue fixes the following too. (Similar with SPARK-14185)
> *Before*
> {code}
> 16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
> generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}
> *After*
> {code}
> 16/03/30 12:46:32 DEBUG WholeStageCodegen: 
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14282) CodeFormatter should handle oneline comment with /* */

2016-03-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14282:
--
Description: 
This issue improves `CodeFormatter` to fix the following cases.

*Before*
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */   boolean isNull = false;
/* 023 */   final Object[] values = new Object[2];
/* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */ final int value9 = -1;
/* 056 */ isNull6 = true;
/* 057 */ value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */ 
{code}

*After*
{code}
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */   final int value9 = -1;
/* 056 */   isNull6 = true;
/* 057 */   value6 = value9;
/* 058 */ } else {
...
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[2];
/* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 077 */ return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */ 
{code}

Also, this issue fixes the following too. (Similar with SPARK-14185)
*Before*
{code}
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}

*After*
{code}
16/03/30 12:46:32 DEBUG WholeStageCodegen: 
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}



  was:
This issue improves `CodeFormatter` to fix the following cases.

*Before*
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */   boolean isNull = false;
/* 023 */   final Object[] values = new Object[2];
/* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */ final int value9 = -1;
/* 056 */ isNull6 = true;
/* 057 */ value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */ 
{code}

*After*
{code}
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */   final int value9 = -1;
/* 056 */   isNull6 = true;
/* 057 */   value6 = value9;
/* 058 */ } else {
...
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[2];
/* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 077 */ return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */ 
{code}

Also, this PR fixes the following too. (Similar with SPARK-14185)
*Before*
{code}
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}

*After*
{code}
16/03/30 12:46:32 DEBUG WholeStageCodegen: 
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}




> CodeFormatter should handle oneline comment with /* */
> --
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if 

[jira] [Commented] (SPARK-13850) TimSort Comparison method violates its general contract

2016-03-30 Thread Regan Dvoskin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218850#comment-15218850
 ] 

Regan Dvoskin commented on SPARK-13850:
---

We're having a query fail on an inner join of two large HIVE tables with the 
same stack trace. The query worked on 1.5, and works on 1.6.0 if 
spark.memory.useLegacyMode is true, but fails on 1.6.0 when useLegacyMode is 
not enabled. 

> TimSort Comparison method violates its general contract
> ---
>
> Key: SPARK-13850
> URL: https://issues.apache.org/jira/browse/SPARK-13850
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>
> While running a query which does a group by on a large dataset, the query 
> fails with following stack trace. 
> {code}
> Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most 
> recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, 
> hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison 
> method violates its general contract!
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:318)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:333)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Please note that the same query used to succeed in Spark 1.5 so it seems like 
> a regression in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Updated] (SPARK-14282) CodeFormatter should handle oneline comment with /* */

2016-03-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14282:
--
Description: 
This issue improves `CodeFormatter` to fix the following cases.

*Before*
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */   boolean isNull = false;
/* 023 */   final Object[] values = new Object[2];
/* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */ final int value9 = -1;
/* 056 */ isNull6 = true;
/* 057 */ value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */ 
{code}

*After*
{code}
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */   final int value9 = -1;
/* 056 */   isNull6 = true;
/* 057 */   value6 = value9;
/* 058 */ } else {
...
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[2];
/* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 077 */ return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */ 
{code}

Also, this PR fixes the following too. (Similar with SPARK-14185)
*Before*
{code}
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}

*After*
{code}
16/03/30 12:46:32 DEBUG WholeStageCodegen: 
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}



  was:
This issue improves `CodeFormatter` to fix the following cases.

* Before *
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */   boolean isNull = false;
/* 023 */   final Object[] values = new Object[2];
/* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */ final int value9 = -1;
/* 056 */ isNull6 = true;
/* 057 */ value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */ 
{code}

* After *
{code}
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */   final int value9 = -1;
/* 056 */   isNull6 = true;
/* 057 */   value6 = value9;
/* 058 */ } else {
...
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[2];
/* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 077 */ return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */ 
{code}

Also, this PR fixes the following too. (Similar with SPARK-14185)
* Before *
{code}
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}

* After *
{code}
16/03/30 12:46:32 DEBUG WholeStageCodegen: 
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}




> CodeFormatter should handle oneline comment with /* */
> --
>
> Key: SPARK-14282
> URL: https://issues.apache.org/jira/browse/SPARK-14282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue improves `CodeFormatter` to fix the following cases.
> *Before*
> {code}
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
> input[0, double], if 

[jira] [Created] (SPARK-14282) CodeFormatter should handle oneline comment with /* */

2016-03-30 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-14282:
-

 Summary: CodeFormatter should handle oneline comment with /* */
 Key: SPARK-14282
 URL: https://issues.apache.org/jira/browse/SPARK-14282
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dongjoon Hyun


This issue improves `CodeFormatter` to fix the following cases.

* Before *
{code}
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */   boolean isNull = false;
/* 023 */   final Object[] values = new Object[2];
/* 024 */   /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */ final int value9 = -1;
/* 056 */ isNull6 = true;
/* 057 */ value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */ 
{code}

* After *
{code}
/* 053 */ if (!false && false) {
/* 054 */   /* null */
/* 055 */   final int value9 = -1;
/* 056 */   isNull6 = true;
/* 057 */   value6 = value9;
/* 058 */ } else {
...
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */ InternalRow i = (InternalRow) _i;
/* 021 */ /* createexternalrow(if (isnull(input[0, double])) null else 
input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */ boolean isNull = false;
/* 023 */ final Object[] values = new Object[2];
/* 024 */ /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */ /* isnull(input[0, double]) */
...
/* 077 */ return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */ 
{code}

Also, this PR fixes the following too. (Similar with SPARK-14185)
* Before *
{code}
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object 
generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}

* After *
{code}
16/03/30 12:46:32 DEBUG WholeStageCodegen: 
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13955) Spark in yarn mode fails

2016-03-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13955.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.0.0

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14211) Remove ANTLR3 based parser

2016-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218844#comment-15218844
 ] 

Apache Spark commented on SPARK-14211:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/12071

> Remove ANTLR3 based parser
> --
>
> Key: SPARK-14211
> URL: https://issues.apache.org/jira/browse/SPARK-14211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14211) Remove ANTLR3 based parser

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14211:


Assignee: (was: Apache Spark)

> Remove ANTLR3 based parser
> --
>
> Key: SPARK-14211
> URL: https://issues.apache.org/jira/browse/SPARK-14211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14211) Remove ANTLR3 based parser

2016-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14211:


Assignee: Apache Spark

> Remove ANTLR3 based parser
> --
>
> Key: SPARK-14211
> URL: https://issues.apache.org/jira/browse/SPARK-14211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-03-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218800#comment-15218800
 ] 

Ryan Blue commented on SPARK-13723:
---

+1

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-03-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218792#comment-15218792
 ] 

Thomas Graves commented on SPARK-13723:
---

I'm saying if either the --num-executors or spark.executor.instances are 
specified and dynamic allocation is on, the number specified would really just 
go to the initial # of executors setting 
(spark.dynamicAllocation.initialExecutors) or if people prefer we could have it 
apply to the maxExecutors.  If a user really wants a static number of executors 
then they turn off the dynamic allocation spark.dynamicAllocation.enabled=false.

I agree with you in general I don't like change the behavior of it but in this 
case I think these settings have confusing interactions and make it difficult 
to turn on dynamic allocation at a cluster level.

I still don't think this is perfect but it makes it more explicitly whether 
dynamic allocation is on or off.  

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14245) webUI should display the user

2016-03-30 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218742#comment-15218742
 ] 

Alex Bozarth commented on SPARK-14245:
--

I'm looking into this in my free moments today, will probably have something in 
the next day or two

> webUI should display the user
> -
>
> Key: SPARK-14245
> URL: https://issues.apache.org/jira/browse/SPARK-14245
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>
> It would be nice if the Spark UI (both active and history) showed the user 
> who ran the application somewhere when you are in the application view.   
> Perhaps under the Jobs view by total uptime and scheduler mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13782) Model export/import for spark.ml: BisectingKMeans

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13782:
--
Shepherd: Joseph K. Bradley  (was: Xiangrui Meng)

> Model export/import for spark.ml: BisectingKMeans
> -
>
> Key: SPARK-13782
> URL: https://issues.apache.org/jira/browse/SPARK-13782
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-03-30 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218701#comment-15218701
 ] 

Marcelo Vanzin commented on SPARK-13723:


I'm not a great fan of changing the behavior, but I understand the point.

To be clear: {{--num-executors}} is mostly a fancy alias for 
{{spark.executor.instances}}. Are you suggesting that you'd break that 
coupling, so that if dynamic allocation is on, it would map to something else? 
And if both {{spark.executor.instances}} and dynamic allocation are provided, 
something else would happen (potentially maintaining the current behavior)?

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14264) Add feature importances for GBTs in Pyspark

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14264:
--
Component/s: PySpark
 ML

> Add feature importances for GBTs in Pyspark
> ---
>
> Key: SPARK-14264
> URL: https://issues.apache.org/jira/browse/SPARK-14264
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> GBT feature importances are now implemented in scala. We should expose them 
> in the pyspark API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14264) Add feature importances for GBTs in Pyspark

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14264:
--
Shepherd: Joseph K. Bradley
Assignee: Seth Hendrickson
Target Version/s: 2.0.0

> Add feature importances for GBTs in Pyspark
> ---
>
> Key: SPARK-14264
> URL: https://issues.apache.org/jira/browse/SPARK-14264
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> GBT feature importances are now implemented in scala. We should expose them 
> in the pyspark API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11892) Model export/import for spark.ml: OneVsRest

2016-03-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11892:
--
Shepherd: Joseph K. Bradley

> Model export/import for spark.ml: OneVsRest
> ---
>
> Key: SPARK-11892
> URL: https://issues.apache.org/jira/browse/SPARK-11892
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Implement read/write for OneVsRest estimator and its model.
> When this is implemented, {{CrossValidatorReader.getUidMap}} should be 
> updated as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14230) Config the start time (jitter) for streaming jobs

2016-03-30 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218672#comment-15218672
 ] 

Liyin Tang commented on SPARK-14230:


[~davies], thanks for the response. If I understand it correctly, this PR you 
pointed is to set a start time for window function. What about non-window 
function, do we also have a way to specify the start time ?



> Config the start time (jitter) for streaming jobs
> -
>
> Key: SPARK-14230
> URL: https://issues.apache.org/jira/browse/SPARK-14230
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Liyin Tang
>
> Currently, RecurringTimer will normalize the start time. For instance, if 
> batch duration is 1 min, all the job will start exactly at 1 min boundary. 
> This actually adds some burden to the streaming source. Assuming the source 
> is Kafka, and there is a list of streaming jobs with 1 min batch duration, 
> then at first few seconds of each min, high network traffic will be observed 
> in Kafka. This makes Kafka capacity planning tricky. 
> It will be great to have an option in the streaming context to set the job 
> start time. In this way, user can add a jitter for the start time for each, 
> and make Kafka fetch_request much smooth across the duration window.
> {code}
> class RecurringTimer {
>   def getStartTime(): Long = {
> (math.floor(clock.currentTime.toDouble / period) + 1).toLong * period + 
> jitter
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-30 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218659#comment-15218659
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

That's the (hopefully) beautiful thing about pex.

{noformat}
$ pex numpy pandas -o c_requirements.pex
$ ./c_requirements.pex
Python 2.7.10 (default, Aug  4 2015, 19:54:05)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import numpy
>>> import pandas
>>> print numpy.__version__
1.11.0
>>> print pandas.__version__
0.18.0
>>>
{noformat}

The catch of course is that the pex file itself would have to be built on a 
node with the same arch as Spark workers (i.e. can't build pex on Mac OS and 
ship to Linux cluster unless all dependencies are pure python). To build a 
platform agnostic env, we'd have to look at conda.

I know there was an effort to support pex with pyspark 
https://github.com/URXtech/spex but it hasn't seen much activity recently. I 
tried reaching out to the author but got no response.

I could take a shot at adding support for this unless @zjffdu already has plans.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14230) Config the start time (jitter) for streaming jobs

2016-03-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218657#comment-15218657
 ] 

Davies Liu commented on SPARK-14230:


This will be supported in structured streaming: see 
https://github.com/apache/spark/pull/12008/files#diff-80a6da9ac9681594543c70c837b12641R2597

> Config the start time (jitter) for streaming jobs
> -
>
> Key: SPARK-14230
> URL: https://issues.apache.org/jira/browse/SPARK-14230
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Liyin Tang
>
> Currently, RecurringTimer will normalize the start time. For instance, if 
> batch duration is 1 min, all the job will start exactly at 1 min boundary. 
> This actually adds some burden to the streaming source. Assuming the source 
> is Kafka, and there is a list of streaming jobs with 1 min batch duration, 
> then at first few seconds of each min, high network traffic will be observed 
> in Kafka. This makes Kafka capacity planning tricky. 
> It will be great to have an option in the streaming context to set the job 
> start time. In this way, user can add a jitter for the start time for each, 
> and make Kafka fetch_request much smooth across the duration window.
> {code}
> class RecurringTimer {
>   def getStartTime(): Long = {
> (math.floor(clock.currentTime.toDouble / period) + 1).toLong * period + 
> jitter
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14281) Fix the java8-tests profile and run those tests in Jenkins

2016-03-30 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-14281:
--

 Summary: Fix the java8-tests profile and run those tests in Jenkins
 Key: SPARK-14281
 URL: https://issues.apache.org/jira/browse/SPARK-14281
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, Tests
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark has some tests for compilation of Java 8 sources (using lambdas) guarded 
behind a {{java8-tests}} maven profile, but we currently do not build or run 
those tests. As a result, the tests no longer compile.

We should fix these tests and set up automated CI so that they don't break 
again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala

2016-03-30 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218611#comment-15218611
 ] 

Josh Rosen commented on SPARK-14279:


This is probably pretty easy to do if you use a generate-sources compilation 
phase to take the version information and stick it into a version.java file or 
something similar. Feel free to submit a PR and I'll review.

> Improve the spark build to pick the version information from the pom file 
> instead of package.scala
> --
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Story
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-30 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218610#comment-15218610
 ] 

Juliet Hougland commented on SPARK-13587:
-

Being able to ship around pex files like we do .py and .egg files sounds very 
reasonable from a delineation of responsibilities perspective.

I like the idea and would support a change like that. A question/edge case 
worth working out is how pex files relate to compiled c libs that python libs 
may need to link to. I don't know much about pex, but initial assessment is 
that it shouldn't be a huge problem. I like this solution.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala

2016-03-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14279:
---
Component/s: Build

> Improve the spark build to pick the version information from the pom file 
> instead of package.scala
> --
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Story
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala

2016-03-30 Thread Sanket Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-14279:
-
Description: Right now the spark-submit --version and other parts of the 
code pick up version information from a static SPARK_VERSION. We would want to  
pick the version from the pom.version probably stored inside a properties file. 
Also, it might be nice to have other details like branch, build information and 
other specific details when having a spark-submit --version

> Improve the spark build to pick the version information from the pom file 
> instead of package.scala
> --
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Story
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >