date:20160329

[jira] [Commented] (SPARK-14264) Add feature importances for GBTs in Pyspark

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217386#comment-15217386
 ] 

Apache Spark commented on SPARK-14264:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/12056

> Add feature importances for GBTs in Pyspark
> ---
>
> Key: SPARK-14264
> URL: https://issues.apache.org/jira/browse/SPARK-14264
> Project: Spark
>  Issue Type: New Feature
>Reporter: Seth Hendrickson
>Priority: Minor
>
> GBT feature importances are now implemented in scala. We should expose them 
> in the pyspark API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14264) Add feature importances for GBTs in Pyspark

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14264:


Assignee: Apache Spark

> Add feature importances for GBTs in Pyspark
> ---
>
> Key: SPARK-14264
> URL: https://issues.apache.org/jira/browse/SPARK-14264
> Project: Spark
>  Issue Type: New Feature
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> GBT feature importances are now implemented in scala. We should expose them 
> in the pyspark API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14264) Add feature importances for GBTs in Pyspark

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14264:


Assignee: (was: Apache Spark)

> Add feature importances for GBTs in Pyspark
> ---
>
> Key: SPARK-14264
> URL: https://issues.apache.org/jira/browse/SPARK-14264
> Project: Spark
>  Issue Type: New Feature
>Reporter: Seth Hendrickson
>Priority: Minor
>
> GBT feature importances are now implemented in scala. We should expose them 
> in the pyspark API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14264) Add feature importances for GBTs in Pyspark

2016-03-29 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-14264:


 Summary: Add feature importances for GBTs in Pyspark
 Key: SPARK-14264
 URL: https://issues.apache.org/jira/browse/SPARK-14264
 Project: Spark
  Issue Type: New Feature
Reporter: Seth Hendrickson
Priority: Minor


GBT feature importances are now implemented in scala. We should expose them in 
the pyspark API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14263:


Assignee: Apache Spark

> Benchmark Vectorized HashMap for GroupBy Aggregates
> ---
>
> Key: SPARK-14263
> URL: https://issues.apache.org/jira/browse/SPARK-14263
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14263:


Assignee: (was: Apache Spark)

> Benchmark Vectorized HashMap for GroupBy Aggregates
> ---
>
> Key: SPARK-14263
> URL: https://issues.apache.org/jira/browse/SPARK-14263
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217375#comment-15217375
 ] 

Apache Spark commented on SPARK-14263:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12055

> Benchmark Vectorized HashMap for GroupBy Aggregates
> ---
>
> Key: SPARK-14263
> URL: https://issues.apache.org/jira/browse/SPARK-14263
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates

2016-03-29 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-14263:
--

 Summary: Benchmark Vectorized HashMap for GroupBy Aggregates
 Key: SPARK-14263
 URL: https://issues.apache.org/jira/browse/SPARK-14263
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13112) CoarsedExecutorBackend register to driver should wait Executor was ready

2016-03-29 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217362#comment-15217362
 ] 

meiyoula commented on SPARK-13112:
--

I agree with it. When  CoarseGrainedExecutorBackend receives 
RegisterExecutorResponse slow after LaunchTask, it will occurs the problem. 
I think we can't make sure CoarseGrainedExecutorBackend receives 
RegisterExecutorResponse before LaunchTask. Maybe CoarseGrainedExecutorBackend 
is busy to send itself RegisterExecutorResponse, and receives LaunchTask 
message first.

> CoarsedExecutorBackend register to driver should wait Executor was ready
> 
>
> Key: SPARK-13112
> URL: https://issues.apache.org/jira/browse/SPARK-13112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: SuYan
>
> desc： 
> due to some host's disk are busy, it will results failed in timeoutException 
> while executor try to register to shuffler server on that host... 
> and then it will exit(1) while launch task on a null executor.
> and yarn cluster resource are a little busy, yarn will thought that host is 
> idle, it will prefer to allocate the same host executor, so it will have a 
> chance that one task failed 4 times in the same host. 
> currently, CoarsedExecutorBackend register to driver first, and after 
> registerDriver successful, then initial Executor. 
> if exception occurs in Executor initialization,
> But Driver don't know that event, will still launch task in that executor,
> then will call system.exit(1). 
> {code}
>  override def receive: PartialFunction[Any, Unit] = { 
>   case RegisteredExecutor(hostname) => 
>   logInfo("Successfully registered with driver") executor = new 
> Executor(executorId, hostname, env, userClassPath, isLocal = false) 
> ..
> case LaunchTask(data) =>
>if (executor == null) {
> logError("Received LaunchTask command but executor was null")
> System.exit(1) 
> {code}
>  It is more reasonable to register with driver after Executor is ready... and 
> make registerTimeout to be configurable...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13060) CoarsedExecutorBackend register to driver should wait Executor was ready?

2016-03-29 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217359#comment-15217359
 ] 

meiyoula commented on SPARK-13060:
--

When CoarseGrainedExecutorBackend receives RegisterExecutorResponse slow after 
LaunchTask, it will occurs this problem. I met it in an actual spark cluster. 

> CoarsedExecutorBackend register to driver should wait Executor was ready?
> -
>
> Key: SPARK-13060
> URL: https://issues.apache.org/jira/browse/SPARK-13060
> Project: Spark
>  Issue Type: Bug
>Reporter: SuYan
>Priority: Trivial
>
> Hi Josh
> I am spark user, currently I feel confused about executor registration.
> {code}
> Why CoarsedExecutorBackend register to driver first, and after registerDriver 
> successful, then initial Executor.
> override def receive: PartialFunction[Any, Unit] = {
>   case RegisteredExecutor(hostname) =>
> logInfo("Successfully registered with driver")
> executor = new Executor(executorId, hostname, env, userClassPath, isLocal 
> = false)
> 
> {code}
> and it is strange that exit(1) while launchTask with executor=null(due to 
> some errors occurs in Executor initialization).
> {code}
> case LaunchTask(data) =>
>   if (executor == null) {
> logError("Received LaunchTask command but executor was null")
> System.exit(1)
> {code}
> What concerns was considered to make registerDriver first, Why not register 
> to driver until Executor sth. have been all ready?  
> like 
> {code}
>   val (hostname, _) = Utils.extractHostPortFromSparkUrl(hostPort)
>   val executor: Executor = new Executor(executorId, hostname, env, 
> userClassPath, isLocal = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-13060) CoarsedExecutorBackend register to driver should wait Executor was ready?

2016-03-29 Thread meiyoula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-13060:
-
Comment: was deleted

(was: When CoarseGrainedExecutorBackend receives RegisterExecutorResponse slow 
after LaunchTask, it will occurs this problem. I met it in an actual spark 
cluster. )

> CoarsedExecutorBackend register to driver should wait Executor was ready?
> -
>
> Key: SPARK-13060
> URL: https://issues.apache.org/jira/browse/SPARK-13060
> Project: Spark
>  Issue Type: Bug
>Reporter: SuYan
>Priority: Trivial
>
> Hi Josh
> I am spark user, currently I feel confused about executor registration.
> {code}
> Why CoarsedExecutorBackend register to driver first, and after registerDriver 
> successful, then initial Executor.
> override def receive: PartialFunction[Any, Unit] = {
>   case RegisteredExecutor(hostname) =>
> logInfo("Successfully registered with driver")
> executor = new Executor(executorId, hostname, env, userClassPath, isLocal 
> = false)
> 
> {code}
> and it is strange that exit(1) while launchTask with executor=null(due to 
> some errors occurs in Executor initialization).
> {code}
> case LaunchTask(data) =>
>   if (executor == null) {
> logError("Received LaunchTask command but executor was null")
> System.exit(1)
> {code}
> What concerns was considered to make registerDriver first, Why not register 
> to driver until Executor sth. have been all ready?  
> like 
> {code}
>   val (hostname, _) = Utils.extractHostPortFromSparkUrl(hostPort)
>   val executor: Executor = new Executor(executorId, hostname, env, 
> userClassPath, isLocal = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13060) CoarsedExecutorBackend register to driver should wait Executor was ready?

2016-03-29 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217360#comment-15217360
 ] 

meiyoula commented on SPARK-13060:
--

When CoarseGrainedExecutorBackend receives RegisterExecutorResponse slow after 
LaunchTask, it will occurs this problem. I met it in an actual spark cluster. 

> CoarsedExecutorBackend register to driver should wait Executor was ready?
> -
>
> Key: SPARK-13060
> URL: https://issues.apache.org/jira/browse/SPARK-13060
> Project: Spark
>  Issue Type: Bug
>Reporter: SuYan
>Priority: Trivial
>
> Hi Josh
> I am spark user, currently I feel confused about executor registration.
> {code}
> Why CoarsedExecutorBackend register to driver first, and after registerDriver 
> successful, then initial Executor.
> override def receive: PartialFunction[Any, Unit] = {
>   case RegisteredExecutor(hostname) =>
> logInfo("Successfully registered with driver")
> executor = new Executor(executorId, hostname, env, userClassPath, isLocal 
> = false)
> 
> {code}
> and it is strange that exit(1) while launchTask with executor=null(due to 
> some errors occurs in Executor initialization).
> {code}
> case LaunchTask(data) =>
>   if (executor == null) {
> logError("Received LaunchTask command but executor was null")
> System.exit(1)
> {code}
> What concerns was considered to make registerDriver first, Why not register 
> to driver until Executor sth. have been all ready?  
> like 
> {code}
>   val (hostname, _) = Utils.extractHostPortFromSparkUrl(hostPort)
>   val executor: Executor = new Executor(executorId, hostname, env, 
> userClassPath, isLocal = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14254) Add logs to help investigate the network performance

2016-03-29 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14254.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0

> Add logs to help investigate the network performance
> 
>
> Key: SPARK-14254
> URL: https://issues.apache.org/jira/browse/SPARK-14254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> It would be very helpful for network performance investigation if we log the 
> time spent on connecting and resolving host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14262) Correct app state after master leader changed

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14262:


Assignee: Apache Spark

> Correct app state after master leader changed
> -
>
> Key: SPARK-14262
> URL: https://issues.apache.org/jira/browse/SPARK-14262
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: ZhengYaofeng
>Assignee: Apache Spark
>Priority: Minor
>
> Suppose that deploy spark in a cluster and enable its recovery mode with 
> zookeeper.
> 1) start spark cluster and submit several apps
> 2) After theses apps' states change to RUNNING, shutdown one master
> 3) another master becomes leader and recovers these apps. these apps' states 
> become WAITING instead of RUNNING, although they are really at work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14262) Correct app state after master leader changed

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14262:


Assignee: (was: Apache Spark)

> Correct app state after master leader changed
> -
>
> Key: SPARK-14262
> URL: https://issues.apache.org/jira/browse/SPARK-14262
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: ZhengYaofeng
>Priority: Minor
>
> Suppose that deploy spark in a cluster and enable its recovery mode with 
> zookeeper.
> 1) start spark cluster and submit several apps
> 2) After theses apps' states change to RUNNING, shutdown one master
> 3) another master becomes leader and recovers these apps. these apps' states 
> become WAITING instead of RUNNING, although they are really at work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14262) Correct app state after master leader changed

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217348#comment-15217348
 ] 

Apache Spark commented on SPARK-14262:
--

User 'GavinGavinNo1' has created a pull request for this issue:
https://github.com/apache/spark/pull/12054

> Correct app state after master leader changed
> -
>
> Key: SPARK-14262
> URL: https://issues.apache.org/jira/browse/SPARK-14262
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: ZhengYaofeng
>Priority: Minor
>
> Suppose that deploy spark in a cluster and enable its recovery mode with 
> zookeeper.
> 1) start spark cluster and submit several apps
> 2) After theses apps' states change to RUNNING, shutdown one master
> 3) another master becomes leader and recovers these apps. these apps' states 
> become WAITING instead of RUNNING, although they are really at work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14262) Correct app state after master leader changed

2016-03-29 Thread ZhengYaofeng (JIRA)

ZhengYaofeng created SPARK-14262:


 Summary: Correct app state after master leader changed
 Key: SPARK-14262
 URL: https://issues.apache.org/jira/browse/SPARK-14262
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 1.5.2
Reporter: ZhengYaofeng
Priority: Minor


Suppose that deploy spark in a cluster and enable its recovery mode with 
zookeeper.
1) start spark cluster and submit several apps
2) After theses apps' states change to RUNNING, shutdown one master
3) another master becomes leader and recovers these apps. these apps' states 
become WAITING instead of RUNNING, although they are really at work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-29 Thread Stefan Krawczyk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217324#comment-15217324
 ] 

Stefan Krawczyk commented on SPARK-14033:
-

{quote}It actually makes it significantly easier to maintain because it 
eliminates a lot of duplicated code. This duplicated functionality between 
Estimator & Model (in setters, schema validation, etc.) leads to 
inconsistencies and bugs; I actually found bugs in StringIndexer and RFormula 
while prototyping the merge of StringIndexer & Model.{quote}
That smells like a different abstraction issue to me. But sure, I can see where 
you're coming from.

One argument for keeping them separate, is that rom a dependency standpoint, 
it'd be advantageous to separate training code (estimators) from prediction 
code (models). That way you could package your trained models and use them 
without having to bring in all the training dependencies. 

> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-29 Thread Luciano Resende (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luciano Resende closed SPARK-13703.
---
Resolution: Not A Problem

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server

2016-03-29 Thread Xiaochun Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochun Liang updated SPARK-14261:
---
Attachment: MemorySnapshot.PNG

This is the memory snapshot with query runs over 11 hours

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Attachments: MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server

2016-03-29 Thread Xiaochun Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochun Liang updated SPARK-14261:
---
Description: I am running Spark Thrift server on Windows Server 2012. The 
Spark Thrift server is launched as Yarn client mode. Its memory usage is 
increased gradually with the queries in.  I am wondering there is memory leak 
in Spark Thrift server.  (was: I am running Spark Thrift server on Windows 
Server 2012. The Spark Thrift server is launched as Yarn client mode. Its 
memory usage is increased gradually with the queries in.  )

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14261) Memory leak in Spark Thrift Server

2016-03-29 Thread Xiaochun Liang (JIRA)

Xiaochun Liang created SPARK-14261:
--

 Summary: Memory leak in Spark Thrift Server
 Key: SPARK-14261
 URL: https://issues.apache.org/jira/browse/SPARK-14261
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Xiaochun Liang


I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
server is launched as Yarn client mode. Its memory usage is increased gradually 
with the queries in.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14260) Increase default value for maxCharsPerColumn

2016-03-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217262#comment-15217262
 ] 

Hyukjin Kwon commented on SPARK-14260:
--

I am currently not sure if this value does not affect performance. Maybe I 
should investigate this and will close if it does matter.

> Increase default value for maxCharsPerColumn
> 
>
> Key: SPARK-14260
> URL: https://issues.apache.org/jira/browse/SPARK-14260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> I guess the default value of the option {{maxCharsPerColumn}} looks 
> relatively small,100 characters meaning 976KB.
> It looks some of guys have a problem with this ending up setting the value 
> manually.
> https://github.com/databricks/spark-csv/issues/295
> https://issues.apache.org/jira/browse/SPARK-14103
> According to [univocity 
> API|http://docs.univocity.com/parsers/2.0.0/com/univocity/parsers/common/CommonSettings.html#setMaxCharsPerColumn(int)],
>  this exists to avoid {{OutOfMemoryErrors}}.
> If this does not harm performance, then I think it would be better to make 
> the default value much bigger (eg. 10MB or 100MB) so that users do not take 
> care of the lengths of each field in CSV file.
> Apparently Apache CSV Parser does not have such limits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14260) Increase default value for maxCharsPerColumn

2016-03-29 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-14260:


 Summary: Increase default value for maxCharsPerColumn
 Key: SPARK-14260
 URL: https://issues.apache.org/jira/browse/SPARK-14260
 Project: Spark
  Issue Type: Sub-task
Reporter: Hyukjin Kwon
Priority: Trivial


I guess the default value of the option {{maxCharsPerColumn}} looks relatively 
small,100 characters meaning 976KB.

It looks some of guys have a problem with this ending up setting the value 
manually.

https://github.com/databricks/spark-csv/issues/295
https://issues.apache.org/jira/browse/SPARK-14103

According to [univocity 
API|http://docs.univocity.com/parsers/2.0.0/com/univocity/parsers/common/CommonSettings.html#setMaxCharsPerColumn(int)],
 this exists to avoid {{OutOfMemoryErrors}}.

If this does not harm performance, then I think it would be better to make the 
default value much bigger (eg. 10MB or 100MB) so that users do not take care of 
the lengths of each field in CSV file.

Apparently Apache CSV Parser does not have such limits.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-29 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217235#comment-15217235
 ] 

zhengruifeng commented on SPARK-14174:
--

Faster KMeans is required in many cases:
high dimensionity > 1000,000
big K > 1000
lager dataset (unlabeld data is much more easy to obtained than labeled data, 
and the data size is ususlly much larger than that used in supervised learning 
task)

In practice, I have encountered a time-consuming case: I need to cluster about 
one billon data with 300,000 dimensions into 1000 center, I use MLLIB's KMenas 
on a cluster of 16 servers, and it takes about one week to end.

I will compare it with BisectingKMeans on speed and cost.

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-03-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217212#comment-15217212
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

For long messages, there is a JIRA opened already here, SPARK-13792.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
> aborting job
> ^M[Stage 1:>  (0 + 1) 
> / 2]
> {code}
> For a

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-03-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217198#comment-15217198
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 3/30/16 1:30 AM:
---

As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls 
[LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186]
 at the end, meaning it would not be a problem.

[~shubhanshumis...@gmail.com] So, the codes below:

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

throw an exception, right?

Could you maybe share the error message? I (or my blind eyes) cannot find the 
exception message about this.


was (Author: hyukjin.kwon):
As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls 
[LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186]
 at the end, meaning it would not be a problem.

[~shubhanshumis...@gmail.com] So, the codes below:

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

throws an exception, right?

Could you maybe share the error message? I (or may blind eyes) cannot find the 
exception message about this.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
>

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-03-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217198#comment-15217198
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls 
[LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186]
 at the end, meaning it would not be a problem.

[~shubhanshumis...@gmail.com] So, the codes below:

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

throws an exception, right?

Could you maybe share the error message? I (or may blind eyes) cannot find the 
exception message about this.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-03-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217200#comment-15217200
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls 
[LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186]
 at the end, meaning it would not be a problem.

[~shubhanshumis...@gmail.com] So, the codes below:

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

throws an exception, right?

Could you maybe share the error message? I (or may blind eyes) cannot find the 
exception message about this.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at

[jira] [Issue Comment Deleted] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-03-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14103:
-
Comment: was deleted

(was: As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls 
[LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186]
 at the end, meaning it would not be a problem.

[~shubhanshumis...@gmail.com] So, the codes below:

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

throws an exception, right?

Could you maybe share the error message? I (or may blind eyes) cannot find the 
exception message about this.)

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at

[jira] [Issue Comment Deleted] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-03-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14103:
-
Comment: was deleted

(was: As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls 
[LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186]
 at the end, meaning it would not be a problem.

[~shubhanshumis...@gmail.com] So, the codes below:

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

throws an exception, right?

Could you maybe share the error message? I (or may blind eyes) cannot find the 
exception message about this.)

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-03-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217199#comment-15217199
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls 
[LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186]
 at the end, meaning it would not be a problem.

[~shubhanshumis...@gmail.com] So, the codes below:

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

throws an exception, right?

Could you maybe share the error message? I (or may blind eyes) cannot find the 
exception message about this.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at

[jira] [Commented] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2016-03-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217194#comment-15217194
 ] 

Hyukjin Kwon commented on SPARK-14194:
--

Yes. CSV data source internally uses {{TextInputFormat}} and it will recognise 
each line by CRLF by default.

I just wonder if standard CSV format allows newlines within records.


> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14258) change scope of some functions in KafkaCluster

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14258:


Assignee: Apache Spark

> change scope of some functions in KafkaCluster
> --
>
> Key: SPARK-14258
> URL: https://issues.apache.org/jira/browse/SPARK-14258
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tao Wang
>Assignee: Apache Spark
>Priority: Minor
>  Labels: kafka
>
> There're some functions in KafkaCluster whose scopes is little larger. As I 
> found it while checking codes, nice to change them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14258) change scope of some functions in KafkaCluster

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217192#comment-15217192
 ] 

Apache Spark commented on SPARK-14258:
--

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/12053

> change scope of some functions in KafkaCluster
> --
>
> Key: SPARK-14258
> URL: https://issues.apache.org/jira/browse/SPARK-14258
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tao Wang
>Priority: Minor
>  Labels: kafka
>
> There're some functions in KafkaCluster whose scopes is little larger. As I 
> found it while checking codes, nice to change them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14258) change scope of some functions in KafkaCluster

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14258:


Assignee: (was: Apache Spark)

> change scope of some functions in KafkaCluster
> --
>
> Key: SPARK-14258
> URL: https://issues.apache.org/jira/browse/SPARK-14258
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tao Wang
>Priority: Minor
>  Labels: kafka
>
> There're some functions in KafkaCluster whose scopes is little larger. As I 
> found it while checking codes, nice to change them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14258) change scope of some functions in KafkaCluster

2016-03-29 Thread Tao Wang (JIRA)

Tao Wang created SPARK-14258:


 Summary: change scope of some functions in KafkaCluster
 Key: SPARK-14258
 URL: https://issues.apache.org/jira/browse/SPARK-14258
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tao Wang
Priority: Minor


There're some functions in KafkaCluster whose scopes is little larger. As I 
found it while checking codes, nice to change them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14259) Add config to control maximum number of files when coalescing partitions

2016-03-29 Thread Nong Li (JIRA)

Nong Li created SPARK-14259:
---

 Summary: Add config to control maximum number of files when 
coalescing partitions
 Key: SPARK-14259
 URL: https://issues.apache.org/jira/browse/SPARK-14259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nong Li
Priority: Minor


The FileSourceStrategy currently has a config to control the maximum byte size 
of coalesced partitions. It is helpful to also have a config to control the 
maximum number of files as even small files have a non-trivial fixed cost. The 
current packing can put a lot of small files together which cases straggler 
tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14127) [Table related commands] Describe table

2016-03-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-14127:
-

Assignee: Andrew Or

> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13703:


Assignee: (was: Apache Spark)

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13703:


Assignee: Apache Spark

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-29 Thread Luciano Resende (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luciano Resende reopened SPARK-13703:
-

Now that it seems we are dropping Scala 2.10 for Spark 2.0, we can cleanup some 
Scala 2.10 specific files.

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217170#comment-15217170
 ] 

Miles Crawford commented on SPARK-14209:


I don't think there's anything custom about our logging - we have it set to 
INFO, and a few specific modules are turned down, but not the ones you mention.

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217150#comment-15217150
 ] 

Marcelo Vanzin commented on SPARK-14209:


Are you using a custom log configuration? I can't find some log messages that I 
can see in the code, especially from BlockManagerMasterEndpoint and 
BlockManagerMaster.

BTW I think the real culprit is executor 51 (container 000191 instead of the 
one you originally mention). Still isolating all the logs related to that 
executor to try to figure out what's going on.

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14125) View related commands

2016-03-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14125:
--
Comment: was deleted

(was: User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12014)

> View related commands
> -
>
> Key: SPARK-14125
> URL: https://issues.apache.org/jira/browse/SPARK-14125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> We should support the following commands.
> TOK_ALTERVIEW_AS
> TOK_ALTERVIEW_PROPERTIES
> TOK_ALTERVIEW_RENAME
> TOK_DROPVIEW
> TOK_DROPVIEW_PROPERTIES
> TOK_DESCTABLE
> For TOK_ALTERVIEW_ADDPARTS/TOK_ALTERVIEW_DROPPARTS, we should throw 
> exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14125) View related commands

2016-03-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14125:
--
Assignee: Xiao Li

> View related commands
> -
>
> Key: SPARK-14125
> URL: https://issues.apache.org/jira/browse/SPARK-14125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>
> We should support the following commands.
> TOK_ALTERVIEW_AS
> TOK_ALTERVIEW_PROPERTIES
> TOK_ALTERVIEW_RENAME
> TOK_DROPVIEW
> TOK_DROPVIEW_PROPERTIES
> TOK_DESCTABLE
> For TOK_ALTERVIEW_ADDPARTS/TOK_ALTERVIEW_DROPPARTS, we should throw 
> exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14124) Database related commands

2016-03-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14124.
---
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> Database related commands
> -
>
> Key: SPARK-14124
> URL: https://issues.apache.org/jira/browse/SPARK-14124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> We should support the following commands.
> TOK_CREATEDATABASE
> TOK_DESCDATABASE
> TOK_DROPDATABASE
> TOK_ALTERDATABASE_PROPERTIES
> For, TOK_ALTERDATABASE_OWNER, let's throw an exception for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13796) Lock release errors occur frequently in executor logs

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217100#comment-15217100
 ] 

Apache Spark commented on SPARK-13796:
--

User 'nishkamravi2' has created a pull request for this issue:
https://github.com/apache/spark/pull/12052

> Lock release errors occur frequently in executor logs
> -
>
> Key: SPARK-13796
> URL: https://issues.apache.org/jira/browse/SPARK-13796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nishkam Ravi
>Priority: Minor
>
> Executor logs contain a lot of these error messages (irrespective of the 
> workload):
> 16/03/08 17:53:07 ERROR executor.Executor: 1 block locks were not released by 
> TID = 1119



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217089#comment-15217089
 ] 

Miles Crawford commented on SPARK-14209:


Here you go, from the same application run: 
https://www.dropbox.com/s/0m5c5rshydhj4am/driver-log.gz?dl=0

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217085#comment-15217085
 ] 

Miles Crawford commented on SPARK-14209:


Here you go, from the same application run: 
https://www.dropbox.com/s/aw7jw1q43lgvxu9/driver-logs.gz?dl=0

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Miles Crawford (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Crawford updated SPARK-14209:
---
Comment: was deleted

(was: Here you go, from the same application run: 
https://www.dropbox.com/s/aw7jw1q43lgvxu9/driver-logs.gz?dl=0)

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14225) Cap the length of toCommentSafeString at 128 chars

2016-03-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14225.
-
   Resolution: Fixed
 Assignee: Sameer Agarwal  (was: Reynold Xin)
Fix Version/s: 2.0.0

> Cap the length of toCommentSafeString at 128 chars
> --
>
> Key: SPARK-14225
> URL: https://issues.apache.org/jira/browse/SPARK-14225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> I just saw a query with over 1000 fields and the generated expression comment 
> is pages long. At that level printing them no longer gives us much benefit. 
> We should just cap the length at some level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14253) Avoid registering temporary functions in Hive

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217063#comment-15217063
 ] 

Apache Spark commented on SPARK-14253:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12051

> Avoid registering temporary functions in Hive
> -
>
> Key: SPARK-14253
> URL: https://issues.apache.org/jira/browse/SPARK-14253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Spark should just handle all temporary functions ourselves instead of passing 
> it to Hive, which is what the HiveFunctionRegistry does at the moment. The 
> extra call to Hive is unnecessary and potentially slow, and it makes the 
> semantics of a "temporary function" confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14253) Avoid registering temporary functions in Hive

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14253:


Assignee: Andrew Or  (was: Apache Spark)

> Avoid registering temporary functions in Hive
> -
>
> Key: SPARK-14253
> URL: https://issues.apache.org/jira/browse/SPARK-14253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Spark should just handle all temporary functions ourselves instead of passing 
> it to Hive, which is what the HiveFunctionRegistry does at the moment. The 
> extra call to Hive is unnecessary and potentially slow, and it makes the 
> semantics of a "temporary function" confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14253) Avoid registering temporary functions in Hive

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14253:


Assignee: Apache Spark  (was: Andrew Or)

> Avoid registering temporary functions in Hive
> -
>
> Key: SPARK-14253
> URL: https://issues.apache.org/jira/browse/SPARK-14253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> Spark should just handle all temporary functions ourselves instead of passing 
> it to Hive, which is what the HiveFunctionRegistry does at the moment. The 
> extra call to Hive is unnecessary and potentially slow, and it makes the 
> semantics of a "temporary function" confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217055#comment-15217055
 ] 

Marcelo Vanzin commented on SPARK-14209:


Are you able to share the logs from your driver?

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12382) Remove spark.mllib GBT implementation and wrap spark.ml

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217051#comment-15217051
 ] 

Apache Spark commented on SPARK-12382:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/12050

> Remove spark.mllib GBT implementation and wrap spark.ml
> ---
>
> Key: SPARK-12382
> URL: https://issues.apache.org/jira/browse/SPARK-12382
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> After the GBT implementation is moved to spark.ml, we should remove the 
> implementation from spark.mllib. The MLlib GBTs will then just call the 
> implementation in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12382) Remove spark.mllib GBT implementation and wrap spark.ml

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12382:


Assignee: (was: Apache Spark)

> Remove spark.mllib GBT implementation and wrap spark.ml
> ---
>
> Key: SPARK-12382
> URL: https://issues.apache.org/jira/browse/SPARK-12382
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> After the GBT implementation is moved to spark.ml, we should remove the 
> implementation from spark.mllib. The MLlib GBTs will then just call the 
> implementation in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12382) Remove spark.mllib GBT implementation and wrap spark.ml

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12382:


Assignee: Apache Spark

> Remove spark.mllib GBT implementation and wrap spark.ml
> ---
>
> Key: SPARK-12382
> URL: https://issues.apache.org/jira/browse/SPARK-12382
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> After the GBT implementation is moved to spark.ml, we should remove the 
> implementation from spark.mllib. The MLlib GBTs will then just call the 
> implementation in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14257:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Allow multiple continuous queries to be started from the same DataFrame
> ---
>
> Key: SPARK-14257
> URL: https://issues.apache.org/jira/browse/SPARK-14257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now DataFrame stores Source in StreamingRelation and prevent from 
> reusing it among multiple continuous queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217049#comment-15217049
 ] 

Apache Spark commented on SPARK-14257:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/12049

> Allow multiple continuous queries to be started from the same DataFrame
> ---
>
> Key: SPARK-14257
> URL: https://issues.apache.org/jira/browse/SPARK-14257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now DataFrame stores Source in StreamingRelation and prevent from 
> reusing it among multiple continuous queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14257:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Allow multiple continuous queries to be started from the same DataFrame
> ---
>
> Key: SPARK-14257
> URL: https://issues.apache.org/jira/browse/SPARK-14257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now DataFrame stores Source in StreamingRelation and prevent from 
> reusing it among multiple continuous queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14253) Avoid registering temporary functions in Hive

2016-03-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217047#comment-15217047
 ] 

Liang-Chi Hsieh commented on SPARK-14253:
-

In fact, current HiveFunctionRegistry doesn't handle creating temporary 
function. It just handles looking up temporary functions and wrapping them in 
Hive UDF classes.

Currently, the DDL commands to create and drop temporary functions are directly 
passed to Hive as native commands. 

> Avoid registering temporary functions in Hive
> -
>
> Key: SPARK-14253
> URL: https://issues.apache.org/jira/browse/SPARK-14253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Spark should just handle all temporary functions ourselves instead of passing 
> it to Hive, which is what the HiveFunctionRegistry does at the moment. The 
> extra call to Hive is unnecessary and potentially slow, and it makes the 
> semantics of a "temporary function" confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14256) Remove parameter sqlContext from as.DataFrame

2016-03-29 Thread Oscar D. Lara Yejas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar D. Lara Yejas updated SPARK-14256:

Description: Currently, the user requires to pass parameter sqlContext to 
both createDataFrame and as.DataFrame. Since sqlContext is a singleton global 
parameter, it should be optional from the signature of as.DataFrame.  (was: 
Currently, the user requires to pass parameter sqlContext to both 
createDataFrame and as.DataFrame. Since sqlContext is a singleton global 
parameter, it should be obviated from the signature of these two methods.)

> Remove parameter sqlContext from as.DataFrame
> -
>
> Key: SPARK-14256
> URL: https://issues.apache.org/jira/browse/SPARK-14256
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Currently, the user requires to pass parameter sqlContext to both 
> createDataFrame and as.DataFrame. Since sqlContext is a singleton global 
> parameter, it should be optional from the signature of as.DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame

2016-03-29 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-14257:


 Summary: Allow multiple continuous queries to be started from the 
same DataFrame
 Key: SPARK-14257
 URL: https://issues.apache.org/jira/browse/SPARK-14257
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now DataFrame stores Source in StreamingRelation and prevent from reusing 
it among multiple continuous queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14256) Remove parameter sqlContext from as.DataFrame

2016-03-29 Thread Oscar D. Lara Yejas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar D. Lara Yejas updated SPARK-14256:

Summary: Remove parameter sqlContext from as.DataFrame  (was: Remove 
parameter sqlContext from as.DataFrame and createDataFrame)

> Remove parameter sqlContext from as.DataFrame
> -
>
> Key: SPARK-14256
> URL: https://issues.apache.org/jira/browse/SPARK-14256
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Currently, the user requires to pass parameter sqlContext to both 
> createDataFrame and as.DataFrame. Since sqlContext is a singleton global 
> parameter, it should be obviated from the signature of these two methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14253) Avoid registering temporary functions in Hive

2016-03-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217044#comment-15217044
 ] 

Liang-Chi Hsieh commented on SPARK-14253:
-

I already add this support in https://github.com/apache/spark/pull/12036.

> Avoid registering temporary functions in Hive
> -
>
> Key: SPARK-14253
> URL: https://issues.apache.org/jira/browse/SPARK-14253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Spark should just handle all temporary functions ourselves instead of passing 
> it to Hive, which is what the HiveFunctionRegistry does at the moment. The 
> extra call to Hive is unnecessary and potentially slow, and it makes the 
> semantics of a "temporary function" confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14256) Remove parameter sqlContext from as.DataFrame and createDataFrame

2016-03-29 Thread Oscar D. Lara Yejas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar D. Lara Yejas updated SPARK-14256:

Description: Currently, the user requires to pass parameter sqlContext to 
both createDataFrame and as.DataFrame. Since sqlContext is a singleton global 
parameter, it should be obviated from the signature of these two methods.

> Remove parameter sqlContext from as.DataFrame and createDataFrame
> -
>
> Key: SPARK-14256
> URL: https://issues.apache.org/jira/browse/SPARK-14256
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Currently, the user requires to pass parameter sqlContext to both 
> createDataFrame and as.DataFrame. Since sqlContext is a singleton global 
> parameter, it should be obviated from the signature of these two methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14256) Remove parameter sqlContext from as.DataFrame and createDataFrame

2016-03-29 Thread Oscar D. Lara Yejas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217018#comment-15217018
 ] 

Oscar D. Lara Yejas commented on SPARK-14256:
-

I'm working on this one

> Remove parameter sqlContext from as.DataFrame and createDataFrame
> -
>
> Key: SPARK-14256
> URL: https://issues.apache.org/jira/browse/SPARK-14256
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14256) Remove parameter sqlContext from as.DataFrame and createDataFrame

2016-03-29 Thread Oscar D. Lara Yejas (JIRA)

Oscar D. Lara Yejas created SPARK-14256:
---

 Summary: Remove parameter sqlContext from as.DataFrame and 
createDataFrame
 Key: SPARK-14256
 URL: https://issues.apache.org/jira/browse/SPARK-14256
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Oscar D. Lara Yejas






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14255) Streaming Aggregation

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217002#comment-15217002
 ] 

Apache Spark commented on SPARK-14255:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/12048

> Streaming Aggregation
> -
>
> Key: SPARK-14255
> URL: https://issues.apache.org/jira/browse/SPARK-14255
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14255) Streaming Aggregation

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14255:


Assignee: Michael Armbrust  (was: Apache Spark)

> Streaming Aggregation
> -
>
> Key: SPARK-14255
> URL: https://issues.apache.org/jira/browse/SPARK-14255
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14255) Streaming Aggregation

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14255:


Assignee: Apache Spark  (was: Michael Armbrust)

> Streaming Aggregation
> -
>
> Key: SPARK-14255
> URL: https://issues.apache.org/jira/browse/SPARK-14255
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml

2016-03-29 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216999#comment-15216999
 ] 

Seth Hendrickson commented on SPARK-12326:
--

[~josephkb] When moving the helper classes to ML, I agree it will be good to 
make classes private where possible. However,  I am not sure what you mean by 
"change the APIs." Also, could you give an example of what you had in mind as 
far as eliminating duplicate data stored in the final model? Thanks!

> Move GBT implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-12326
> URL: https://issues.apache.org/jira/browse/SPARK-12326
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Several improvements can be made to gradient boosted trees, but are not 
> possible without moving the GBT implementation to spark.ml (e.g. 
> rawPrediction column, feature importance). This Jira is for moving the 
> current GBT implementation to spark.ml, which will have roughly the following 
> steps:
> 1. Copy the implementation to spark.ml and change spark.ml classes to use 
> that implementation. Current tests will ensure that the implementations learn 
> exactly the same models. 
> 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, 
> InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since 
> eventually all tree implementations will reside in spark.ml, the helper 
> classes should as well.
> 3. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation. The spark.ml tests will again 
> ensure that we do not change any behavior.
> 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14215) Support chained Python UDF

2016-03-29 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14215.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12014
[https://github.com/apache/spark/pull/12014]

> Support chained Python UDF
> --
>
> Key: SPARK-14215
> URL: https://issues.apache.org/jira/browse/SPARK-14215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
>  pyudf(pyudf(xx))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14255) Streaming Aggregation

2016-03-29 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-14255:


 Summary: Streaming Aggregation
 Key: SPARK-14255
 URL: https://issues.apache.org/jira/browse/SPARK-14255
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14224) Cannot project all columns from a table with ~1,100 columns

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14224:


Assignee: (was: Apache Spark)

> Cannot project all columns from a table with ~1,100 columns
> ---
>
> Key: SPARK-14224
> URL: https://issues.apache.org/jira/browse/SPARK-14224
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> I created a temporary table from 1000 genomes dataset and cached it. When I 
> try select all columns for even a single row I get following exception. 
> Setting and unsetting {{spark.sql.codegen.wholeStage}} and 
> {{spark.sql.codegen}} has no effect. 
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf")
> vcfData.registerTempTable("genomesTable")
> %sql select * from genomesTable
> Error in SQL statement: RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
> /* 001 */ 
> /* 002 */ public java.lang.Object generate(Object[] references) {
> /* 003 */   return new SpecificSafeProjection(references);
> {code}
> cc [~rxin] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14223) Cannot project all columns from a parquet files with ~1,100 columns

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14223:


Assignee: Apache Spark

> Cannot project all columns from a parquet files with ~1,100 columns
> ---
>
> Key: SPARK-14223
> URL: https://issues.apache.org/jira/browse/SPARK-14223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>Priority: Critical
> Attachments: 1000genomes.gz.parquet
>
>
> The parquet file is generated by saving first 10 rows of the 1000 genomes 
> dataset. When I try to run "select * from" a temp table created from the file 
> I get following error:
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf")
> data.registerTempTable("genomesTable")
> sql("select * from genomesTable limit 
> 10").write.format("parquet").save("/tmp/test-genome.parquet")
> sqlContext.read.format("parquet").load("/tmp/test-genome.parquet").registerTempTable("parquetTable")
> %sql select * from parquetTable
> SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 4695, 
> ip-10-0-251-144.us-west-2.compute.internal): java.lang.ClassCastException: 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:237)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:230)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:82)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14224) Cannot project all columns from a table with ~1,100 columns

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216909#comment-15216909
 ] 

Apache Spark commented on SPARK-14224:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12047

> Cannot project all columns from a table with ~1,100 columns
> ---
>
> Key: SPARK-14224
> URL: https://issues.apache.org/jira/browse/SPARK-14224
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> I created a temporary table from 1000 genomes dataset and cached it. When I 
> try select all columns for even a single row I get following exception. 
> Setting and unsetting {{spark.sql.codegen.wholeStage}} and 
> {{spark.sql.codegen}} has no effect. 
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf")
> vcfData.registerTempTable("genomesTable")
> %sql select * from genomesTable
> Error in SQL statement: RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
> /* 001 */ 
> /* 002 */ public java.lang.Object generate(Object[] references) {
> /* 003 */   return new SpecificSafeProjection(references);
> {code}
> cc [~rxin] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14223) Cannot project all columns from a parquet files with ~1,100 columns

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14223:


Assignee: (was: Apache Spark)

> Cannot project all columns from a parquet files with ~1,100 columns
> ---
>
> Key: SPARK-14223
> URL: https://issues.apache.org/jira/browse/SPARK-14223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Critical
> Attachments: 1000genomes.gz.parquet
>
>
> The parquet file is generated by saving first 10 rows of the 1000 genomes 
> dataset. When I try to run "select * from" a temp table created from the file 
> I get following error:
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf")
> data.registerTempTable("genomesTable")
> sql("select * from genomesTable limit 
> 10").write.format("parquet").save("/tmp/test-genome.parquet")
> sqlContext.read.format("parquet").load("/tmp/test-genome.parquet").registerTempTable("parquetTable")
> %sql select * from parquetTable
> SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 4695, 
> ip-10-0-251-144.us-west-2.compute.internal): java.lang.ClassCastException: 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:237)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:230)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:82)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14224) Cannot project all columns from a table with ~1,100 columns

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14224:


Assignee: Apache Spark

> Cannot project all columns from a table with ~1,100 columns
> ---
>
> Key: SPARK-14224
> URL: https://issues.apache.org/jira/browse/SPARK-14224
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>
> I created a temporary table from 1000 genomes dataset and cached it. When I 
> try select all columns for even a single row I get following exception. 
> Setting and unsetting {{spark.sql.codegen.wholeStage}} and 
> {{spark.sql.codegen}} has no effect. 
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf")
> vcfData.registerTempTable("genomesTable")
> %sql select * from genomesTable
> Error in SQL statement: RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
> /* 001 */ 
> /* 002 */ public java.lang.Object generate(Object[] references) {
> /* 003 */   return new SpecificSafeProjection(references);
> {code}
> cc [~rxin] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14223) Cannot project all columns from a parquet files with ~1,100 columns

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216910#comment-15216910
 ] 

Apache Spark commented on SPARK-14223:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12047

> Cannot project all columns from a parquet files with ~1,100 columns
> ---
>
> Key: SPARK-14223
> URL: https://issues.apache.org/jira/browse/SPARK-14223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Critical
> Attachments: 1000genomes.gz.parquet
>
>
> The parquet file is generated by saving first 10 rows of the 1000 genomes 
> dataset. When I try to run "select * from" a temp table created from the file 
> I get following error:
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf")
> data.registerTempTable("genomesTable")
> sql("select * from genomesTable limit 
> 10").write.format("parquet").save("/tmp/test-genome.parquet")
> sqlContext.read.format("parquet").load("/tmp/test-genome.parquet").registerTempTable("parquetTable")
> %sql select * from parquetTable
> SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 4695, 
> ip-10-0-251-144.us-west-2.compute.internal): java.lang.ClassCastException: 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:237)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:230)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:82)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216899#comment-15216899
 ] 

Miles Crawford commented on SPARK-14209:


That's fantastic news! let me know if I can help test a fix.

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14254) Add logs to help investigate the network performance

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14254:


Assignee: Apache Spark

> Add logs to help investigate the network performance
> 
>
> Key: SPARK-14254
> URL: https://issues.apache.org/jira/browse/SPARK-14254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> It would be very helpful for network performance investigation if we log the 
> time spent on connecting and resolving host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14254) Add logs to help investigate the network performance

2016-03-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14254:


Assignee: (was: Apache Spark)

> Add logs to help investigate the network performance
> 
>
> Key: SPARK-14254
> URL: https://issues.apache.org/jira/browse/SPARK-14254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> It would be very helpful for network performance investigation if we log the 
> time spent on connecting and resolving host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14254) Add logs to help investigate the network performance

2016-03-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216893#comment-15216893
 ] 

Apache Spark commented on SPARK-14254:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/12046

> Add logs to help investigate the network performance
> 
>
> Key: SPARK-14254
> URL: https://issues.apache.org/jira/browse/SPARK-14254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> It would be very helpful for network performance investigation if we log the 
> time spent on connecting and resolving host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14254) Add logs to help investigate the network performance

2016-03-29 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-14254:
-
Description: It would be very helpful for network performance investigation 
if we log the time spent on connecting and resolving host.  (was: It would be 
very helpful for if we log the time spent on connecting and resolving host.)

> Add logs to help investigate the network performance
> 
>
> Key: SPARK-14254
> URL: https://issues.apache.org/jira/browse/SPARK-14254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> It would be very helpful for network performance investigation if we log the 
> time spent on connecting and resolving host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14254) Add logs to help investigate the network performance

2016-03-29 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-14254:
-
Description: It would be very helpful for if we log the time spent on 
connecting and resolving host.  (was: It would be very helpful if we log the 
time spent on connecting and resolving host.)

> Add logs to help investigate the network performance
> 
>
> Key: SPARK-14254
> URL: https://issues.apache.org/jira/browse/SPARK-14254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> It would be very helpful for if we log the time spent on connecting and 
> resolving host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14254) Add logs to help investigate the network performance

2016-03-29 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-14254:


 Summary: Add logs to help investigate the network performance
 Key: SPARK-14254
 URL: https://issues.apache.org/jira/browse/SPARK-14254
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu


It would be very helpful if we log the time spent on connecting and resolving 
host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216888#comment-15216888
 ] 

Marcelo Vanzin commented on SPARK-14209:


2.0 doesn't currently suffer from this problem because, instead, it suffers 
from SPARK-14252. :-p

It shouldn't be hard to provide a fix for 1.6, though.

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14252) Executors do not try to download remote cached blocks

2016-03-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216886#comment-15216886
 ] 

Marcelo Vanzin commented on SPARK-14252:


Here's a slightly updated piece of code where it's easier to show the problem:

{code}
val sc = new SparkContext(conf)
try {
  val mapCount = sc.accumulator(0L)

  val rdd = sc.parallelize(1 to 1000, 10).map { i =>
mapCount += 1
i
  }.cache()
  rdd.count()

  println(s"Map count after first count: $mapCount")

  // Create a single task that will sleep and block, so that a particular 
executor is busy.
  // This should force future tasks to download cached data from that 
executor.
  println("Running sleep job..")
  val thread =  new Thread(new Runnable() {
override def run(): Unit = {
  rdd.mapPartitionsWithIndex { (i, iter) =>
if (i == 0) {
  Thread.sleep(TimeUnit.MINUTES.toMillis(10))
}
iter
  }.count()
}
  })
  thread.setDaemon(true)
  thread.start()

  // Wait a few seconds to make sure the task is running (too lazy for 
listeners)
  println("Waiting for tasks to start...")
  TimeUnit.SECONDS.sleep(10)

  // Now run a job that will touch everything and should use the cached 
data.
  val cnt = rdd.map(_*2).count()
  println(s"Counted $cnt elements.")

  println("Killing sleep job.")
  thread.interrupt()
  thread.join()

  println(s"Map count after all tasks finished: $mapCount")
} finally {
  sc.stop()
}
{code}

On 1.6. I get:

{noformat}
Map count after first count: 1000
Map count after all tasks finished: 1000
{noformat}

On 2.0 I get:

{noformat}
Map count after first count: 1000
Map count after all tasks finished: 1500
{noformat}

> Executors do not try to download remote cached blocks
> -
>
> Key: SPARK-14252
> URL: https://issues.apache.org/jira/browse/SPARK-14252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 
> includes SPARK-12817, which changed the caching code a bit to remove 
> duplication. But it seems to have removed the part where executors check 
> whether other executors contain the needed cached block.
> In 1.6, that was done by the call to {{BlockManager.get}} in 
> {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls 
> {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and 
> thus the executor never gets block that are cached by other executors, 
> causing the blocks to be instead recomputed locally.
> I wrote a small program that shows this. In 1.6, running with 
> {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages 
> like these in the logs:
> {noformat}
> 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered 
> locally
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 
> from BlockManagerId(1, blah, 58831)
> 1
> {noformat}
> On 2.0, I get (almost) all the 10 partitions cached on *both* executors, 
> because once the second one fails to find a block locally it just recomputes 
> it and caches it. It never tries to download the block from the other 
> executor.  The log messages above, which still exist in the code, don't show 
> up anywhere.
> Here's the code I used for the above (trimmed of some other stuff from my 
> little test harness, so might not compile as is):
> {code}
> val sc = new SparkContext(conf)
> try {
>   val rdd = sc.parallelize(1 to 1000, 10)
>   rdd.cache()
>   rdd.count()
>   // Create a single task that will sleep and block, so that a particular 
> executor is busy.
>   // This should force future tasks to download cached data from that 
> executor.
>   println("Running sleep job..")
>   val thread =  new Thread(new Runnable() {
> override def run(): Unit = {
>   rdd.mapPartitionsWithIndex { (i, iter) =>
> if (i == 0) {
>   Thread.sleep(TimeUnit.MINUTES.toMillis(10))
> }
> iter
>   }.count()
> }
>   })
>   thread.setDaemon(true)
>   thread.start()
>   // Wait a few seconds to make sure the task is running (too lazy for 
> listeners)
>   println("Waiting for tasks to start...")
>   TimeUnit.SECONDS.sleep(10)
>   // Now run a job that will touch

[jira] [Created] (SPARK-14253) Avoid registering temporary functions in Hive

2016-03-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-14253:
-

 Summary: Avoid registering temporary functions in Hive
 Key: SPARK-14253
 URL: https://issues.apache.org/jira/browse/SPARK-14253
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or


Spark should just handle all temporary functions ourselves instead of passing 
it to Hive, which is what the HiveFunctionRegistry does at the moment. The 
extra call to Hive is unnecessary and potentially slow, and it makes the 
semantics of a "temporary function" confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14252) Executors do not try to download remote cached blocks

2016-03-29 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-14252:
--

 Summary: Executors do not try to download remote cached blocks
 Key: SPARK-14252
 URL: https://issues.apache.org/jira/browse/SPARK-14252
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin


I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 
includes SPARK-12817, which changed the caching code a bit to remove 
duplication. But it seems to have removed the part where executors check 
whether other executors contain the needed cached block.

In 1.6, that was done by the call to {{BlockManager.get}} in 
{{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls 
{{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and 
thus the executor never gets block that are cached by other executors, causing 
the blocks to be instead recomputed locally.

I wrote a small program that shows this. In 1.6, running with {{--num-executors 
2}}, I get 5 blocks cached on each executor, and messages like these in the 
logs:

{noformat}
16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7
16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7
16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered 
locally
16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7
16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 from 
BlockManagerId(1, blah, 58831)
1
{noformat}

On 2.0, I get (almost) all the 10 partitions cached on *both* executors, 
because once the second one fails to find a block locally it just recomputes it 
and caches it. It never tries to download the block from the other executor.  
The log messages above, which still exist in the code, don't show up anywhere.


Here's the code I used for the above (trimmed of some other stuff from my 
little test harness, so might not compile as is):

{code}
val sc = new SparkContext(conf)
try {
  val rdd = sc.parallelize(1 to 1000, 10)
  rdd.cache()
  rdd.count()

  // Create a single task that will sleep and block, so that a particular 
executor is busy.
  // This should force future tasks to download cached data from that 
executor.
  println("Running sleep job..")
  val thread =  new Thread(new Runnable() {
override def run(): Unit = {
  rdd.mapPartitionsWithIndex { (i, iter) =>
if (i == 0) {
  Thread.sleep(TimeUnit.MINUTES.toMillis(10))
}
iter
  }.count()
}
  })
  thread.setDaemon(true)
  thread.start()

  // Wait a few seconds to make sure the task is running (too lazy for 
listeners)
  println("Waiting for tasks to start...")
  TimeUnit.SECONDS.sleep(10)

  // Now run a job that will touch everything and should use the cached 
data.
  val cnt = rdd.map(_*2).count()
  println(s"Counted $cnt elements.")

  println("Killing sleep job.")
  thread.interrupt()
  thread.join()
} finally {
  sc.stop()
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14227) Add method for printing out generated code for debugging

2016-03-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14227.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.0.0

> Add method for printing out generated code for debugging
> 
>
> Key: SPARK-14227
> URL: https://issues.apache.org/jira/browse/SPARK-14227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently it's hard to get at the output of WholeStageCodegen. It would be 
> nice to add debug support for the generated code.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2016-03-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216769#comment-15216769
 ] 

Michael Zieliński commented on SPARK-7146:
--

We ended up copying significant portion of Params (HasInputCol(s), 
HasOutputCol(s), HasProbabilityCol) for our custom Transformers/Estimators. 
This creates a problem if you want to abstract Transformer, e.g. have a piece 
of code that accepts any transformer with "setInputCol" method. Because our 
HasInputCol and native Spark HasInputCol are different traits, that results in 
very clunky code (structural typing + few ugly type aliases).

Opening up at least some traits would result in cleaner code and more reuse. 
This document 
(https://docs.google.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#)
 separates traits in a few groups, like I/O, hyper-parameters and optimization. 
I/O would be a very good place to start, since those are re-used most 
frequently across algorithms.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Discussion: Should the Param traits in sharedParams.scala be public?
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> Proposal: Either
> (a) make the shared params private to encourage users to write specialized 
> documentation and value checks for parameters, or
> (b) design a better way to encourage overriding documentation and parameter 
> value checks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14229) PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat

2016-03-29 Thread Russell Jurney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Jurney updated SPARK-14229:
---
Shepherd: Matei Zaharia

> PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat
> --
>
> Key: SPARK-14229
> URL: https://issues.apache.org/jira/browse/SPARK-14229
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, Spark Shell
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>
> I am able to save data to MongoDB from any RDD... provided that RDD does not 
> belong to a DataFrame. If I use DataFrame.rdd, it is not possible to save via 
> saveAsNewAPIHadoopFile whatsoever. I have tested that this applies to saving 
> to MongoDB, BSON Files, and ElasticSearch.
> I get the following error when I try to save to a HadoopFile:
> config = {"mongo.output.uri": 
> "mongodb://localhost:27017/agile_data_science.on_time_performance"}
> n [3]: on_time_dataframe.rdd.saveAsNewAPIHadoopFile(
>...:   path='file://unused', 
>...:   outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
>...:   keyClass='org.apache.hadoop.io.Text', 
>...:   valueClass='org.apache.hadoop.io.MapWritable', 
>...:   conf=config
>...: )
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 62.7 KB, free 147.3 KB)
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1_piece0 stored 
> as bytes in memory (estimated size 20.4 KB, free 167.7 KB)
> 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
> memory on localhost:61301 (size: 20.4 KB, free: 511.1 MB)
> 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 1 from 
> javaToPython at NativeMethodAccessorImpl.java:-2
> 16/03/28 19:59:57 INFO Configuration.deprecation: mapred.min.split.size is 
> deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
> 16/03/28 19:59:57 INFO parquet.ParquetRelation: Reading Parquet file(s) from 
> file:/Users/rjurney/Software/Agile_Data_Code_2/data/on_time_performance.parquet/part-r-0-32089f1b-5447-4a75-b008-4fd0a0a8b846.gz.parquet
> 16/03/28 19:59:57 INFO spark.SparkContext: Starting job: take at 
> SerDeUtil.scala:231
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Got job 1 (take at 
> SerDeUtil.scala:231) with 1 output partitions
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 
> (take at SerDeUtil.scala:231)
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting ResultStage 1 
> (MapPartitionsRDD[6] at mapPartitions at SerDeUtil.scala:146), which has no 
> missing parents
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 14.9 KB, free 182.6 KB)
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 7.5 KB, free 190.1 KB)
> 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on localhost:61301 (size: 7.5 KB, free: 511.1 MB)
> 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1006
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 1 (MapPartitionsRDD[6] at mapPartitions at 
> SerDeUtil.scala:146)
> 16/03/28 19:59:57 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 
> 1 tasks
> 16/03/28 19:59:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 1.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2739 bytes)
> 16/03/28 19:59:57 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 
> 8)
> 16/03/28 19:59:58 INFO 
> parquet.ParquetRelation$$anonfun$buildInternalScan$1$$anon$1: Input split: 
> ParquetInputSplit{part: 
> file:/Users/rjurney/Software/Agile_Data_Code_2/data/on_time_performance.parquet/part-r-0-32089f1b-5447-4a75-b008-4fd0a0a8b846.gz.parquet
>  start: 0 end: 134217728 length: 134217728 hosts: []}
> 16/03/28 19:59:59 INFO compress.CodecPool: Got brand-new decompressor [.gz]
> 16/03/28 19:59:59 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 8)
> net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for pyspark.sql.types._create_row)
>   at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>   at

[jira] [Commented] (SPARK-14229) PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat

2016-03-29 Thread Russell Jurney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216759#comment-15216759
 ] 

Russell Jurney commented on SPARK-14229:


To reproduce for MongoDB:

The data I am having trouble storing is at 
s3://agile_data_science/on_time_performance.parquet and its ACL is set to 
public, so you should be able to load it. If you can't reproduce, let me know 
and I will try to setup S3 access from my machine (I'm working in local mode).

My init code:

{code:title=init.py}
import json
import pymongo
import pymongo_spark
pymongo_spark.activate()
{code}

To reproduce a successful loading of this data, as a textFile to an RDD without 
ever being a DataFrame (note that this generates 20GB of temporary files, be 
sure you have the space!):

{code:title=success.py}
on_time_lines = 
sc.textFile("s3://agile_data_science/On_Time_On_Time_Performance_2015.jsonl.gz")
on_time_performance = on_time_lines.map(lambda x: json.loads(x))
on_time_performance.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance')
{code}

To reproduce a failed loading of this data when saving the RDD of a DataFrame 
or its descendants: 

{code:title=failure.py}
on_time_dataframe = 
sqlContext.read.parquet('s3://agile_data_science/on_time_performance.parquet')
on_time_dataframe = on_time_dataframe.drop("") # remove empty field
on_time_dataframe.rdd.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance')
{code}

And with the longer syntax:

{code:title=long_failure.py}
config = {"mongo.output.uri": 
"mongodb://localhost:27017/agile_data_science.on_time_performance"}
on_time_dataframe.rdd.saveAsNewAPIHadoopFile(
  path='file://unused', 
  outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
  keyClass='org.apache.hadoop.io.Text', 
  valueClass='org.apache.hadoop.io.MapWritable', 
  conf=config
)
{code}

The documents look like:

{code}
Row(Year=2015, Quarter=1, Month=1, DayofMonth=1, DayOfWeek=4, 
FlightDate=u'2015-01-01', UniqueCarrier=u'AA', AirlineID=19805, Carrier=u'AA', 
TailNum=u'N787AA', FlightNum=1, OriginAirportID=12478, 
OriginAirportSeqID=1247802, OriginCityMarketID=31703, Origin=u'JFK', 
OriginCityName=u'New York, NY', OriginState=u'NY', OriginStateFips=36, 
OriginStateName=u'New York', OriginWac=22, DestAirportID=12892, 
DestAirportSeqID=1289203, DestCityMarketID=32575, Dest=u'LAX', 
DestCityName=u'Los Angeles, CA', DestState=u'CA', DestStateFips=6, 
DestStateName=u'California', DestWac=91, CRSDepTime=900, DepTime=855, 
DepDelay=-5.0, DepDelayMinutes=0.0, DepDel15=0.0, DepartureDelayGroups=-1, 
DepTimeBlk=u'0900-0959', TaxiOut=17.0, WheelsOff=912, WheelsOn=1230, 
TaxiIn=7.0, CRSArrTime=1230, ArrTime=1237, ArrDelay=7.0, ArrDelayMinutes=7.0, 
ArrDel15=0.0, ArrivalDelayGroups=0, ArrTimeBlk=u'1200-1259', Cancelled=0.0, 
CancellationCode=u'', Diverted=0.0, CRSElapsedTime=390.0, 
ActualElapsedTime=402.0, AirTime=378.0, Flights=1.0, Distance=2475.0, 
DistanceGroup=10, CarrierDelay=None, WeatherDelay=None, NASDelay=None, 
SecurityDelay=None, LateAircraftDelay=None, FirstDepTime=None, 
TotalAddGTime=None, LongestAddGTime=None, DivAirportLandings=0, 
DivReachedDest=None, DivActualElapsedTime=None, DivArrDelay=None, 
DivDistance=None, Div1Airport=u'', Div1AirportID=None, Div1AirportSeqID=None, 
Div1WheelsOn=None, Div1TotalGTime=None, Div1LongestGTime=None, 
Div1WheelsOff=None, Div1TailNum=u'', Div2Airport=u'', Div2AirportID=None, 
Div2AirportSeqID=None, Div2WheelsOn=None, Div2TotalGTime=None, 
Div2LongestGTime=None, Div2WheelsOff=None, Div2TailNum=u'', Div3Airport=u'', 
Div3AirportID=None, Div3AirportSeqID=None, Div3WheelsOn=None, 
Div3TotalGTime=None, Div3LongestGTime=None, Div3WheelsOff=u'', Div3TailNum=u'', 
Div4Airport=u'', Div4AirportID=u'', Div4AirportSeqID=u'', Div4WheelsOn=u'', 
Div4TotalGTime=u'', Div4LongestGTime=u'', Div4WheelsOff=u'', Div4TailNum=u'', 
Div5Airport=u'', Div5AirportID=u'', Div5AirportSeqID=u'', Div5WheelsOn=u'', 
Div5TotalGTime=u'', Div5LongestGTime=u'', Div5WheelsOff=u'', Div5TailNum=u'')
{code}

As JSON:

{code:title=data.json}
{"Year":2015,"Quarter":1,"Month":1,"DayofMonth":10,"DayOfWeek":6,"FlightDate":"2015-01-10","UniqueCarrier":"AA","AirlineID":19805,"Carrier":"AA","TailNum":"N790AA","FlightNum":1,"OriginAirportID":12478,"OriginAirportSeqID":1247802,"OriginCityMarketID":31703,"Origin":"JFK","OriginCityName":"New
 York, NY","OriginState":"NY","OriginStateFips":36,"OriginStateName":"New 
York","OriginWac":22,"DestAirportID":12892,"DestAirportSeqID":1289203,"DestCityMarketID":32575,"Dest":"LAX","DestCityName":"Los
 Angeles,

[jira] [Commented] (SPARK-14033) Merging Estimator & Model

2016-03-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216756#comment-15216756
 ] 

Michael Zieliński commented on SPARK-14033:
---

Re: ML vs MLLib, I also think about it in terms RDDs versus DataFrames.

Re: Estimator/Model I prefer the current version that preserves immutability to 
a larger degree. That said, maybe merging those concepts would make it easier 
for the next stage of a Pipeline to use outputs from previous stage. Currently 
if you have:

val a1 = new Estimator1
val a2 = new Estimator2.setParamAbc(a1.getParamCde)

You can only get the members from Estimator1, but not Estimator1Model. If 
they're the same class it would make things easier. As an example you want to 
take top K variables from Random Forest model as input to Logistic Regression. 



> Merging Estimator & Model
> -
>
> Key: SPARK-14033
> URL: https://issues.apache.org/jira/browse/SPARK-14033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Attachments: StyleMutabilityMergingEstimatorandModel.pdf
>
>
> This JIRA is for merging the spark.ml concepts of Estimator and Model.
> Goal: Have clearer semantics which match existing libraries (such as 
> scikit-learn).
> For details, please see the linked design doc.  Comment on this JIRA to give 
> feedback on the proposed design.  Once the proposal is discussed and this 
> work is confirmed as ready to proceed, this JIRA will serve as an umbrella 
> for the merge tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14248) Get the path hierarchy from root to leaf in the BisectingKMeansModel

2016-03-29 Thread Lakesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216747#comment-15216747
 ] 

Lakesh commented on SPARK-14248:


[~josephkb]. It would be a very small addition. You can assign it to me if 
appropriate. Thanks

> Get the path hierarchy from root to leaf in the BisectingKMeansModel
> 
>
> Key: SPARK-14248
> URL: https://issues.apache.org/jira/browse/SPARK-14248
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Lakesh
>Priority: Trivial
>  Labels: newbie
>
> I was using the BisectingKMeansModel of mllib in spark. However, I couldn't 
> find an existing method in the BisectingKMeansModel to return me the path 
> hierarchy. So I added a small function that would return me the prediction 
> path itself. I was thinking this could be useful for people with similar use 
> case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14246) vars not updated after Scala script reload

2016-03-29 Thread Jim Powers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216734#comment-15216734
 ] 

Jim Powers commented on SPARK-14246:


It appears that this is a Scala problem and not a spark one.

{noformat}
scala -Yrepl-class-based
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74).
Type in expressions for evaluation. Or try :help.

scala> :load Fail.scala
Loading Fail.scala...
X: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: 
Double,v: Seq[Double]): Double} = $anon$1@5ddf0d24

scala> val a = X.getArray(100)
warning: there was one feature warning; re-run with -feature for details
a: Array[Double] = Array(0.6445967025789217, 0.8356433165638456, 
0.9050186287112574, 0.06344554850936357, 0.008363070900988756, 
0.6593626537474886, 0.7424265307039932, 0.5035629973234215, 0.5510831160354674, 
0.37366654438205593, 0.33299751842582703, 0.3800633883472283, 
0.4153963387304084, 0.2752468331316783, 0.8699452196820426, 
0.31938530945559984, 0.7990568815957724, 0.6875841747139724, 
0.31949965197609675, 0.026911873556428878, 0.2616536698127736, 
0.5580118021155783, 0.28345994848845435, 0.1773433165304532, 
0.2549417030032525, 0.9777692443616465, 0.6296846343712603, 0.8589339648876033, 
0.7020098253707141, 0.8518829567943531, 0.41154622619731374, 
0.1075129613308311, 0.8499252434316056, 0.3841876768177086, 0.137415801614582, 
0.27030938222499756, 0.5511585560527115, 0.26252884257087217, ...
scala> X = null
X: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: 
Double,v: Seq[Double]): Double} = null

scala> :load Fail.scala
Loading Fail.scala...
X: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: 
Double,v: Seq[Double]): Double} = $anon$1@52c8295b

scala> X
res0: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: 
Double,v: Seq[Double]): Double} = null
{noformat}

So, it appears any anonymous class with 2 or more members may exhibit this 
problem in the presence of {{-Yrepl-class-based}}.

Closing this issue.

> vars not updated after Scala script reload
> --
>
> Key: SPARK-14246
> URL: https://issues.apache.org/jira/browse/SPARK-14246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Jim Powers
> Attachments: Fail.scala, Null.scala, reproduce_transient_npe.scala
>
>
> Attached are two scripts.  The problem only exhibits itself with Spark 1.6.0, 
> 1.6.1, and 2.0.0 using Scala 2.11.  Scala 2.10 does not exhibit this problem. 
>  With the Regular Scala 2.11(.7) REPL:
> {noformat}
> scala> :load reproduce_transient_npe.scala
> Loading reproduce_transient_npe.scala...
> X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def 
> multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = 
> $anon$1@4149c063
> scala> X
> res0: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def 
> multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = 
> $anon$1@4149c063
> scala> val a = X.getArray(10)
> warning: there was one feature warning; re-run with -feature for details
> a: Array[Double] = Array(0.1701063617079236, 0.17570862034857437, 
> 0.6065851472098629, 0.4683069994589304, 0.35194859652378363, 
> 0.04033043823203897, 0.11917887149548367, 0.540367871104426, 
> 0.18544859881040276, 0.7236380062803334)
> scala> X = null
> X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def 
> multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = null
> scala> :load reproduce_transient_npe.scala
> Loading reproduce_transient_npe.scala...
> X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def 
> multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = 
> $anon$1@5860f3d7
> scala> X
> res1: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def 
> multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = 
> $anon$1@5860f3d7
> {noformat}
> However, from within the Spark shell (Spark 1.6.0, Scala 2.11.7):
> {noformat}
> scala> :load reproduce_transient_npe.scala
> Loading reproduce_transient_npe.scala...
> X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def 
> multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = 
> $anon$1@750e2d33
> scala> val a = X.getArray(100)
> warning: there was one feature warning; re-run with -feature for details
> a: Array[Double] = Array(0.6330055191546612, 0.017865502179445936, 
> 0.6334775064489349, 0.9053929733525056, 0.7648311134918273, 
> 0.5423177955113584, 0.5164344368587143, 0.420054677669768, 
> 0.7842112717076851, 0.2098345684721057, 0.7925640951404774, 
> 0.5604706596425998, 0.8104403239147542, 0.7567005967624031, 
> 0.5221119883682028,

1 2 3 >

1 - 100 of 268 matches

Mail list logo