[jira] [Assigned] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15931:


Assignee: Apache Spark

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15931:


Assignee: (was: Apache Spark)

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328984#comment-15328984
 ] 

Apache Spark commented on SPARK-15931:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/13636

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15937) Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish

2016-06-13 Thread Subroto Sanyal (JIRA)
Subroto Sanyal created SPARK-15937:
--

 Summary: Spark declares a succeeding job to be failed in 
yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish
 Key: SPARK-15937
 URL: https://issues.apache.org/jira/browse/SPARK-15937
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Subroto Sanyal


h5. Problem:
Spark Job fails in yarn-cluster mode if the job takes less time than 10 
seconds. The job execution here is successful but, spark framework declares it 
failed.
{noformat}
16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1465791692084_0078_01
16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(subroto); users 
with modify permissions: Set(subroto)
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry 
on cluster...
16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of 
US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.big-decimal.precision=32
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.default-timezone=US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from 
US/Eastern to US/Eastern
16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit 
Server VM, 1.7 (Oracle Corporation)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m 
-Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_01/tmp
 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_01
 -XX:MaxPermSize=256m 
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  
'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties'
 (default classpath)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before 
Plugin Registry start-up: 847.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: -
16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task 
configuration...
16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop 
task to DEBUG: 
16/06/13 10:50:35 INFO cluster.JobProcessor: Processing 
JobInput{_jobName=Import job (76): 
BookorderHS2ImportJob_SparkCluster#import(Identity)}
16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to 
hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
16/06/13 10:50:35 DEBUG hdfs.DFSClient: 
/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: 
masked=rw-r--r--
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #2
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #2
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: 
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
chunkSize=516, chunksPerPacket=127, packetSize=65532
16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for 
[DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new 
packet seqno=0, 
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
16/06/13 10:50:35 DEBUG hdfs.DF

[jira] [Commented] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR

2016-06-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328949#comment-15328949
 ] 

Dongjoon Hyun commented on SPARK-15908:
---

Hi, [~sunrui].
I did SPARK-15807. If you didn't start yet, may I do this too?

> Add varargs-type dropDuplicates() function in SparkR
> 
>
> Key: SPARK-15908
> URL: https://issues.apache.org/jira/browse/SPARK-15908
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is for API parity of Scala API. Refer to 
> https://issues.apache.org/jira/browse/SPARK-15807



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328941#comment-15328941
 ] 

Manoj Kumar commented on SPARK-14351:
-

I can try working on this.

> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15932.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> document the contract of encoder serializer expressions
> ---
>
> Key: SPARK-15932
> URL: https://issues.apache.org/jira/browse/SPARK-15932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328939#comment-15328939
 ] 

Manoj Kumar edited comment on SPARK-3155 at 6/14/16 5:01 AM:
-

1. I agree that the use cases are limited to single trees. You kind of lose 
interpretability if you train the tree to maximum depth. It helps in improving 
interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after 
the validation error increases) . However this is very similar to just having a 
stopping criterion such as maximum depth, minimum samples in each node (except 
that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my 
lectures) is to train the tree to full depth and remove the leaves according to 
validation data.

However, if you feel that #14351 is more important, I can focus on that.


was (Author: mechcoder):
1. I agree that the use cases are limited to single trees. You kind of lose 
interpretability if you train the tree to maximum depth. It helps in improving 
interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after 
the validation error increases) . However this is very similar to just having a 
stopping criterion such as maximum depth, minimum samples in each node (except 
that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my 
lectures) is to train the train to full depth and remove the leaves according 
to validation data.

However, if you feel that #14351 is more important, I can focus on that.

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328939#comment-15328939
 ] 

Manoj Kumar commented on SPARK-3155:


1. I agree that the use cases are limited to single trees. You kind of lose 
interpretability if you train the tree to maximum depth. It helps in improving 
interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after 
the validation error increases) . However this is very similar to just having a 
stopping criterion such as maximum depth, minimum samples in each node (except 
that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my 
lectures) is to train the train to full depth and remove the leaves according 
to validation data.

However, if you feel that #14351 is more important, I can focus on that.

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-13 Thread Sean McKibben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328894#comment-15328894
 ] 

Sean McKibben commented on SPARK-12177:
---

Unfortunately I can't contribute what I would like to, but I wholeheartedly 
agree with Cody that waiting for 0.11 doesn't make sense. Kafka is a lynchpin 
of many production scenarios and skipping 0.9 was bad enough for Spark. Waiting 
longer will make Spark overall less competitive in the streaming/fast data 
landscape, and the community will have much harder choices between Kafka 
Streams, Akka Streams, and Spark if another version of Kafka is omitted.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14351:
--
Priority: Major  (was: Minor)

> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10835) Change Output of NGram to Array(String, True)

2016-06-13 Thread Hansa Nanayakkara (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328883#comment-15328883
 ] 

Hansa Nanayakkara commented on SPARK-10835:
---

Although problem is solved for the Tokenizer it persists in NGram class

> Change Output of NGram to Array(String, True)
> -
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328880#comment-15328880
 ] 

Joseph K. Bradley commented on SPARK-3155:
--

A few thoughts:

(1) I'm less sure about the priority of this task now.  I've had a hard time 
identifying use cases.  Few people train single trees.  For forests, people 
generally want to overfit each tree a bit, not prune.  For boosting, people 
generally use shallow trees so that there is no need for pruning.  It would be 
useful to identify real use cases before we implement this feature.

(2) I agree the 2 args are validation data + error tolerance.

(3) Will most users want to prune during or after training?
* During training: More efficient
* After training: Allows multiple prunings using different error tolerances

I'd say that [SPARK-14351] is the highest priority single-tree improvement I 
know of right now.

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-06-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15364:
--
Assignee: Liang-Chi Hsieh

> Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
> ---
>
> Key: SPARK-15364
> URL: https://issues.apache.org/jira/browse/SPARK-15364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Now picklers for both new and old vectors are implemented under 
> PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement 
> them under `spark.ml.python` instead. I set the target to 2.1 since those are 
> private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-06-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15364:
--
Target Version/s: 2.0.0  (was: 2.1.0)

> Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
> ---
>
> Key: SPARK-15364
> URL: https://issues.apache.org/jira/browse/SPARK-15364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
> Fix For: 2.0.0
>
>
> Now picklers for both new and old vectors are implemented under 
> PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement 
> them under `spark.ml.python` instead. I set the target to 2.1 since those are 
> private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-06-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15364.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13219
[https://github.com/apache/spark/pull/13219]

> Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
> ---
>
> Key: SPARK-15364
> URL: https://issues.apache.org/jira/browse/SPARK-15364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
> Fix For: 2.0.0
>
>
> Now picklers for both new and old vectors are implemented under 
> PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement 
> them under `spark.ml.python` instead. I set the target to 2.1 since those are 
> private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file

2016-06-13 Thread marymwu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328860#comment-15328860
 ] 

marymwu commented on SPARK-15757:
-

Any update?

> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed on this 
> orc file
> ---
>
> Key: SPARK-15757
> URL: https://issues.apache.org/jira/browse/SPARK-15757
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
> Attachments: Result.png
>
>
> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed
> 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in 
> stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): 
> java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To u

[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order

2016-06-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328851#comment-15328851
 ] 

Hyukjin Kwon commented on SPARK-15918:
--

Actually, I met this case before and was thinking it might be an issue. 
However, I realised it seems actually not after executing the equelvant quries 
in other several DBMS and checking some documentations such as 
https://msdn.microsoft.com/en-us/library/ms180026.aspx and 
http://www.w3schools.com/sql/sql_union.asp that say 

{quote}
 the columns in each SELECT statement must be in the same order
{quote}

I haven't read about the official SQL standard though, I am also pretty sure 
that this is not an issue.


> unionAll returns wrong result when two dataframes has schema in different 
> order
> ---
>
> Key: SPARK-15918
> URL: https://issues.apache.org/jira/browse/SPARK-15918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: CentOS
>Reporter: Prabhu Joseph
>
> On applying unionAll operation between A and B dataframes, they both has same 
> schema but in different order and hence the result has column value mapping 
> changed.
> Repro:
> {code}
> A.show()
> +---++---+--+--+-++---+--+---+---+-+
> |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---++---+--+--+-++---+--+---+---+-+
> +---++---+--+--+-++---+--+---+---+-+
> B.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> +-+---+--+---+---+--+--+--+---+---+--++
> A = A.unionAll(B)
> A.show()
> +---+---+--+--+--+-++---+--+---+---+-+
> |tag|   year_day|   
> tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---+---+--+--+--+-++---+--+---+---+-+
> |  F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> +---+---+--+--+--+-++---+--+---+---+-+
> {code}
> On changing the schema of A according to B and doing unionAll works fine
> {code}
> C = 
> A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day")
> A = C.unionAll(B)
> A.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+

[jira] [Comment Edited] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328802#comment-15328802
 ] 

yuhao yang edited comment on SPARK-15930 at 6/14/16 2:46 AM:
-

That looks reasonable. +1. 

[~JohnDA] I would wait for one or two days to collect more opinions before 
sending a patch. You're welcome to send a pull request if you're interested.


was (Author: yuhaoyan):
That looks reasonable. +1. 

[~John Aherne] I would wait for one or two days to collect more opinions before 
sending a patch. You're welcome to send a pull request if you're interested.

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats

2016-06-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15808:
-
Assignee: Xiao Li

> Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to 
> Mismatched File Formats
> ---
>
> Key: SPARK-15808
> URL: https://issues.apache.org/jira/browse/SPARK-15808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Example 1: PARQUET -> CSV
> {noformat}
> createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
> {noformat}
> Error we got: 
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
> java.lang.RuntimeException: 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
>  is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
> found [79, 82, 67, 23]
> {noformat}
> Example 2: Json -> CSV
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
> {noformat}
> No exception, but wrong results:
> {noformat}
> +++
> |  c1|  c2|
> +++
> |null|null|
> |null|null|
> |null|null|
> |null|null|
> |   0|str0|
> |   1|str1|
> |   2|str2|
> |   3|str3|
> |   4|str4|
> |   5|str5|
> |   6|str6|
> |   7|str7|
> |   8|str8|
> |   9|str9|
> +++
> {noformat}
> Example 3: Json -> Text
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
> {noformat}
> Error we got: 
> {noformat}
> Text data source supports only a single column, and you have 2 columns.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats

2016-06-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15808.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13546
[https://github.com/apache/spark/pull/13546]

> Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to 
> Mismatched File Formats
> ---
>
> Key: SPARK-15808
> URL: https://issues.apache.org/jira/browse/SPARK-15808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> Example 1: PARQUET -> CSV
> {noformat}
> createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
> {noformat}
> Error we got: 
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
> java.lang.RuntimeException: 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
>  is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
> found [79, 82, 67, 23]
> {noformat}
> Example 2: Json -> CSV
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
> {noformat}
> No exception, but wrong results:
> {noformat}
> +++
> |  c1|  c2|
> +++
> |null|null|
> |null|null|
> |null|null|
> |null|null|
> |   0|str0|
> |   1|str1|
> |   2|str2|
> |   3|str3|
> |   4|str4|
> |   5|str5|
> |   6|str6|
> |   7|str7|
> |   8|str8|
> |   9|str9|
> +++
> {noformat}
> Example 3: Json -> Text
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
> {noformat}
> Error we got: 
> {noformat}
> Text data source supports only a single column, and you have 2 columns.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328815#comment-15328815
 ] 

Reynold Xin commented on SPARK-15690:
-

Definitely no serialization/deserialization.


> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15934) Return binary mode in ThriftServer

2016-06-13 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328814#comment-15328814
 ] 

Egor Pahomov commented on SPARK-15934:
--

Sure, let me create a pull request tomorrow. I would test, that everything 
working with all tools I mentioned - Tableau, DataGrip, Squirrel. 

> Return binary mode in ThriftServer
> --
>
> Key: SPARK-15934
> URL: https://issues.apache.org/jira/browse/SPARK-15934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
> It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
> was default and it turned off in 2.0.0.
> Just to describe magnitude of harm not fixing this bug would do in my 
> organization:
> * Tableau works only though Thrift Server and only with binary format. 
> Tableau would not work with spark-2.0.0 at all!
> * I have bunch of analysts in my organization with configured sql 
> clients(DataGrip and Squirrel). I would need to go one by one to change 
> connection string for them(DataGrip). Squirrel simply do not work with http - 
> some jar hell in my case.
> * let me not mention all other stuff which connects to our data 
> infrastructure through ThriftServer as gateway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328802#comment-15328802
 ] 

yuhao yang commented on SPARK-15930:


That looks reasonable. +1. 

[~John Aherne] I would wait for one or two days to collect more opinions before 
sending a patch. You're welcome to send a pull request if you're interested.

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread John Aherne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328786#comment-15328786
 ] 

John Aherne edited comment on SPARK-15930 at 6/14/16 2:04 AM:
--

The row count would be the number of rows in the dataset that was supplied to 
the train function.

Edited: I put in the wrong answer/


was (Author: johnda):
In your example, the row count would be 4. 

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread John Aherne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328786#comment-15328786
 ] 

John Aherne commented on SPARK-15930:
-

In your example, the row count would be 4. 

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328759#comment-15328759
 ] 

yuhao yang edited comment on SPARK-15930 at 6/14/16 1:30 AM:
-

||   items|| freq||
|[27]|5|
|[27, 18]|2|
|[27, 18, 12]|2|
|[27, 18, 12, 17]|1|

Hi [~JohnDA] can you please specify what's the row count you expected with the 
model above? is it the total number of 5 + 2 + 2 + 4?


was (Author: yuhaoyan):
||   items|| freq||
|[27]|5|
|[27, 18]|2|
|[27, 18, 12]|2|
|[27, 18, 12, 17]|4|

Hi [~JohnDA] can you please specify what's the row count you expected with the 
model above? is it the total number of 5 + 2 + 2 + 4?

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328759#comment-15328759
 ] 

yuhao yang commented on SPARK-15930:


||   items|| freq||
|[27]|5|
|[27, 18]|2|
|[27, 18, 12]|2|
|[27, 18, 12, 17]|4|

Hi [~JohnDA] can you please specify what's the row count you expected with the 
model above? is it the total number of 5 + 2 + 2 + 4?

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15934) Return binary mode in ThriftServer

2016-06-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328756#comment-15328756
 ] 

Reynold Xin commented on SPARK-15934:
-

[~epahomov] do you want to create a pr to revert the change?

> Return binary mode in ThriftServer
> --
>
> Key: SPARK-15934
> URL: https://issues.apache.org/jira/browse/SPARK-15934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
> It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
> was default and it turned off in 2.0.0.
> Just to describe magnitude of harm not fixing this bug would do in my 
> organization:
> * Tableau works only though Thrift Server and only with binary format. 
> Tableau would not work with spark-2.0.0 at all!
> * I have bunch of analysts in my organization with configured sql 
> clients(DataGrip and Squirrel). I would need to go one by one to change 
> connection string for them(DataGrip). Squirrel simply do not work with http - 
> some jar hell in my case.
> * let me not mention all other stuff which connects to our data 
> infrastructure through ThriftServer as gateway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15935) Enable test for sql/streaming.py and fix these tests

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15935:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Enable test for sql/streaming.py and fix these tests
> 
>
> Key: SPARK-15935
> URL: https://issues.apache.org/jira/browse/SPARK-15935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now tests  sql/streaming.py are disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15935) Enable test for sql/streaming.py and fix these tests

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15935:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Enable test for sql/streaming.py and fix these tests
> 
>
> Key: SPARK-15935
> URL: https://issues.apache.org/jira/browse/SPARK-15935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now tests  sql/streaming.py are disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15935) Enable test for sql/streaming.py and fix these tests

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328736#comment-15328736
 ] 

Apache Spark commented on SPARK-15935:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/13655

> Enable test for sql/streaming.py and fix these tests
> 
>
> Key: SPARK-15935
> URL: https://issues.apache.org/jira/browse/SPARK-15935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now tests  sql/streaming.py are disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15936) CLONE - Add class weights to Random Forest

2016-06-13 Thread Yuewei Na (JIRA)
Yuewei Na created SPARK-15936:
-

 Summary: CLONE - Add class weights to Random Forest
 Key: SPARK-15936
 URL: https://issues.apache.org/jira/browse/SPARK-15936
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.1
Reporter: Yuewei Na


Currently, this implementation of random forest does not support class weights. 
Class weights are important when there is imbalanced training data or the 
evaluation metric of a classifier is imbalanced (e.g. true positive rate at 
some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328723#comment-15328723
 ] 

Apache Spark commented on SPARK-15868:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/13654

> Executors table in Executors tab should sort Executor IDs in numerical order 
> (not alphabetical order)
> -
>
> Key: SPARK-15868
> URL: https://issues.apache.org/jira/browse/SPARK-15868
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-sorting-2.png, 
> spark-webui-executors-sorting.png
>
>
> It _appears_ that Executors table in Executors tab sorts Executor IDs in 
> alphabetical order while it should in numerical. It does sorting in a more 
> "friendly" way yet driver executor appears between 0 and 1?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15868:


Assignee: (was: Apache Spark)

> Executors table in Executors tab should sort Executor IDs in numerical order 
> (not alphabetical order)
> -
>
> Key: SPARK-15868
> URL: https://issues.apache.org/jira/browse/SPARK-15868
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-sorting-2.png, 
> spark-webui-executors-sorting.png
>
>
> It _appears_ that Executors table in Executors tab sorts Executor IDs in 
> alphabetical order while it should in numerical. It does sorting in a more 
> "friendly" way yet driver executor appears between 0 and 1?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15868:


Assignee: Apache Spark

> Executors table in Executors tab should sort Executor IDs in numerical order 
> (not alphabetical order)
> -
>
> Key: SPARK-15868
> URL: https://issues.apache.org/jira/browse/SPARK-15868
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
> Attachments: spark-webui-executors-sorting-2.png, 
> spark-webui-executors-sorting.png
>
>
> It _appears_ that Executors table in Executors tab sorts Executor IDs in 
> alphabetical order while it should in numerical. It does sorting in a more 
> "friendly" way yet driver executor appears between 0 and 1?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder

2016-06-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15910:

Assignee: Sean Owen

> Schema is not checked when converting DataFrame to Dataset using Kryo encoder
> -
>
> Key: SPARK-15910
> URL: https://issues.apache.org/jira/browse/SPARK-15910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Owen
> Fix For: 2.0.0
>
>
> Here is the case to reproduce it:
> {code}
> scala> import org.apache.spark.sql.Encoders._
> scala> import org.apache.spark.sql.Encoders
> scala> import org.apache.spark.sql.Encoder
> scala> case class B(b: Int)
> scala> implicit val encoder = Encoders.kryo[B]
> encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary]
> scala> val ds = Seq((1)).toDF("b").as[B].map(identity)
> ds: org.apache.spark.sql.Dataset[B] = [value: binary]
> scala> ds.show()
> 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 45, Column 168: No applicable constructor/method found for actual parameters 
> "int"; candidates are: "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[], int, int)"
> ...
> {code}
> The expected behavior is to report schema check failure earlier when creating 
> Dataset using {code}dataFrame.as[B]{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder

2016-06-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15910:

Assignee: Sean Zhong  (was: Sean Owen)

> Schema is not checked when converting DataFrame to Dataset using Kryo encoder
> -
>
> Key: SPARK-15910
> URL: https://issues.apache.org/jira/browse/SPARK-15910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.0.0
>
>
> Here is the case to reproduce it:
> {code}
> scala> import org.apache.spark.sql.Encoders._
> scala> import org.apache.spark.sql.Encoders
> scala> import org.apache.spark.sql.Encoder
> scala> case class B(b: Int)
> scala> implicit val encoder = Encoders.kryo[B]
> encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary]
> scala> val ds = Seq((1)).toDF("b").as[B].map(identity)
> ds: org.apache.spark.sql.Dataset[B] = [value: binary]
> scala> ds.show()
> 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 45, Column 168: No applicable constructor/method found for actual parameters 
> "int"; candidates are: "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[], int, int)"
> ...
> {code}
> The expected behavior is to report schema check failure earlier when creating 
> Dataset using {code}dataFrame.as[B]{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder

2016-06-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15910.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13632
[https://github.com/apache/spark/pull/13632]

> Schema is not checked when converting DataFrame to Dataset using Kryo encoder
> -
>
> Key: SPARK-15910
> URL: https://issues.apache.org/jira/browse/SPARK-15910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Fix For: 2.0.0
>
>
> Here is the case to reproduce it:
> {code}
> scala> import org.apache.spark.sql.Encoders._
> scala> import org.apache.spark.sql.Encoders
> scala> import org.apache.spark.sql.Encoder
> scala> case class B(b: Int)
> scala> implicit val encoder = Encoders.kryo[B]
> encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary]
> scala> val ds = Seq((1)).toDF("b").as[B].map(identity)
> ds: org.apache.spark.sql.Dataset[B] = [value: binary]
> scala> ds.show()
> 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 45, Column 168: No applicable constructor/method found for actual parameters 
> "int"; candidates are: "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[], int, int)"
> ...
> {code}
> The expected behavior is to report schema check failure earlier when creating 
> Dataset using {code}dataFrame.as[B]{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15935) Enable test for sql/streaming.py and fix these tests

2016-06-13 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-15935:


 Summary: Enable test for sql/streaming.py and fix these tests
 Key: SPARK-15935
 URL: https://issues.apache.org/jira/browse/SPARK-15935
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now tests  sql/streaming.py are disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328717#comment-15328717
 ] 

Shivaram Venkataraman commented on SPARK-15690:
---

Yeah I dont think you'll see much improvement from avoiding the DAGScheduler. 
One more thing to try here is to avoid serialization / deserialization unless 
you are going to spill to disk. That'll save a lot of time inside a single node.

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328707#comment-15328707
 ] 

Cody Koeninger commented on SPARK-12177:


I don't think waiting for 0.11 makes sense.



> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15934) Return binary mode in ThriftServer

2016-06-13 Thread Egor Pahomov (JIRA)
Egor Pahomov created SPARK-15934:


 Summary: Return binary mode in ThriftServer
 Key: SPARK-15934
 URL: https://issues.apache.org/jira/browse/SPARK-15934
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov


In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
was default and it turned off in 2.0.0.

Just to describe magnitude of harm not fixing this bug would do in my 
organization:

* Tableau works only though Thrift Server and only with binary format. Tableau 
would not work with spark-2.0.0 at all!
* I have bunch of analysts in my organization with configured sql 
clients(DataGrip and Squirrel). I would need to go one by one to change 
connection string for them(DataGrip). Squirrel simply do not work with http - 
some jar hell in my case.
* let me not mention all other stuff which connects to our data infrastructure 
through ThriftServer as gateway. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-13 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328699#comment-15328699
 ] 

Mark Grover commented on SPARK-12177:
-

Hi Ismael and Cody,
My personal opinion was to hold off because a) The new consumer API was still 
marked as beta, and so I wasn't sure of the compatibility guarantees, which 
Kafka did seem to break a little (as discussed 
[here|http://mail-archives.apache.org/mod_mbox/kafka-dev/201605.mbox/%3CCAKm=r7v5jgg9qxgjioczdph9vej57m46ngy_626kiq-ovdx...@mail.gmail.com%3E])
 b) the real benefit is security - I am personally a little more biased towards 
authentication (Kerberos) than encryption, so I was just waiting for delegation 
tokens to land. 

Now, that 0.10.0 is released, there's a good chance delegation tokens would 
land in Kafka 0.11.0, and the new consumer API is marked stable, I am more open 
to this PR being merged, it's been around for too long anyways. Cody, what do 
you say? Any reason you'd want to wait? If not, we can make a case for this 
going in now.

As far the logistics of whether this belongs in Apache Bahir or not - today, I 
don't have a strong opinion on where kafka integration should reside. What I do 
feel strongly about, like Cody said, is that the old consumer API integration 
and new consumer API integration should reside in the same place. Since the old 
integration is in Spark, that's where the new makes sense. If a vote on Apache 
Spark results in Kafka integration to be taken out, both the new and the old in 
Apache Bahir would make sense.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable

2016-06-13 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15929.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13649
[https://github.com/apache/spark/pull/13649]

> DataFrameSuite path globbing error message tests are not fully portable
> ---
>
> Key: SPARK-15929
> URL: https://issues.apache.org/jira/browse/SPARK-15929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> The DataFrameSuite regression tests for SPARK-13774 fail in my environment 
> because they attempt to glob over all of {{/mnt}} and some of the 
> subdirectories in there have restrictive permissions which cause the test to 
> fail. I think we should rewrite this test to not depend existing / OS paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328648#comment-15328648
 ] 

Shixiong Zhu commented on SPARK-15905:
--

The last time I encounter FileOutputStream.writeBytes hangs is because I 
created a Process in Java but didn't consume its input stream and error stream. 
Finally, the underlying buffer was full and blocked the Process.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328640#comment-15328640
 ] 

Shixiong Zhu commented on SPARK-15905:
--

By the way, how did you use Spark? Did you just run it or call it via some 
Process APIs?

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328638#comment-15328638
 ] 

Shixiong Zhu commented on SPARK-15905:
--

Oh, the thread state is `RUNNABLE`. So not a deadlock. Could you check you 
disk? Maybe some bad disks cause the hang.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328627#comment-15328627
 ] 

Shixiong Zhu edited comment on SPARK-15905 at 6/13/16 11:42 PM:


Do you have the whole jstack output? I guess some place holds the lock of 
`System.err` but needs the whole output for all threads to find the place.


was (Author: zsxwing):
Do you have the whole jstack output? I guess some places holds the lock of 
`System.err` but needs the whole output for all threads to find the place.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328627#comment-15328627
 ] 

Shixiong Zhu commented on SPARK-15905:
--

Do you have the whole jstack output? I guess some places holds the lock of 
`System.err` but needs the whole output for all threads to find the place.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15933:


Assignee: Tathagata Das  (was: Apache Spark)

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15933:


Assignee: Apache Spark  (was: Tathagata Das)

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328609#comment-15328609
 ] 

Apache Spark commented on SPARK-15933:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13653

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-15933:
-

 Summary: Refactor reader-writer interface for streaming DFs to use 
DataStreamReader/Writer
 Key: SPARK-15933
 URL: https://issues.apache.org/jira/browse/SPARK-15933
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Currently, the DataFrameReader/Writer has method that are needed for streaming 
and non-streaming DFs. This is quite awkward because each method in them 
through runtime exception for one case or the other. So rather having half the 
methods throw runtime exceptions, its just better to have a different 
reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328608#comment-15328608
 ] 

Tejas Patil commented on SPARK-15905:
-

Another instance but this time not via console progress bar. This job has been 
stuck for 15+ hours.

{noformat}
"dispatcher-event-loop-23" #60 daemon prio=5 os_prio=0 tid=0x7f981e206000 
nid=0x685f8 runnable [0x7f8c0f1ef000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
- locked <0x7f8d48167058> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(PrintStream.java:480)
- locked <0x7f8d48167020> (a java.io.PrintStream)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
- locked <0x7f8d48237680> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
- locked <0x7f8d48235ee0> (a org.apache.log4j.ConsoleAppender)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
- locked <0x7f8d481bf1e8> (a org.apache.log4j.spi.RootLogger)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:400)
at org.apache.spark.Logging$class.logWarning(Logging.scala:70)
at 
org.apache.spark.scheduler.TaskSetManager.logWarning(TaskSetManager.scala:52)
at 
org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:721)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$6.apply(TaskSetManager.scala:813)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$6.apply(TaskSetManager.scala:807)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:807)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:536)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:474)
- locked <0x7f8d5850e1e0> (a 
org.apache.spark.scheduler.TaskSchedulerImpl)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:263)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:202)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:202)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:202)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.sp

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328597#comment-15328597
 ] 

Shixiong Zhu commented on SPARK-15905:
--

[~tejasp] Probably some deadlock in Spark. It would be great if you can provide 
the full jstack output.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328592#comment-15328592
 ] 

Manoj Kumar commented on SPARK-3155:


I would like to add support for pruning DecisionTrees as part of my internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?


> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328591#comment-15328591
 ] 

Tejas Patil commented on SPARK-15905:
-

[~zsxwing] : This does not repro consistently but happens one off cases.. that 
too over different jobs. I have seen this 3-4 times in last week. The type of 
jobs I was running were pure SQL queries with SELECT, JOINs and GROUP BY. Sorry 
I cannot share the exact query neither the data. But I am quite positive that 
this problem would have nothing to do with the query being ran.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-3155:
---
Comment: was deleted

(was: I would like to add support for pruning DecisionTrees as part of my 
internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?
)

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15932:


Assignee: Wenchen Fan  (was: Apache Spark)

> document the contract of encoder serializer expressions
> ---
>
> Key: SPARK-15932
> URL: https://issues.apache.org/jira/browse/SPARK-15932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15932:


Assignee: Apache Spark  (was: Wenchen Fan)

> document the contract of encoder serializer expressions
> ---
>
> Key: SPARK-15932
> URL: https://issues.apache.org/jira/browse/SPARK-15932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328586#comment-15328586
 ] 

Apache Spark commented on SPARK-15932:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13648

> document the contract of encoder serializer expressions
> ---
>
> Key: SPARK-15932
> URL: https://issues.apache.org/jira/browse/SPARK-15932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity

2016-06-13 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15914:
---
Description: 
We removed some deprecated method in SQLContext in branch Spark 2.0.

For example:
{code}
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
read.json(path)
  }
{code}

These deprecated method may be used by existing third party data source. We 
probably want to add them back to remain source code level backward 
compatibility. 

  was:
We removed some deprecated method in SQLContext in branch Spark 2.0.

For example:
{code}
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
read.json(path)
  }
{code}

These deprecated method may be used by existing third party data source. We 
probably want to add them back to remain backward-compatibiity. 


> Add deprecated method back to SQLContext for source code backward compatiblity
> --
>
> Key: SPARK-15914
> URL: https://issues.apache.org/jira/browse/SPARK-15914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> We removed some deprecated method in SQLContext in branch Spark 2.0.
> For example:
> {code}
>   @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
>   def jsonFile(path: String): DataFrame = {
> read.json(path)
>   }
> {code}
> These deprecated method may be used by existing third party data source. We 
> probably want to add them back to remain source code level backward 
> compatibility. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-15932:
---

 Summary: document the contract of encoder serializer expressions
 Key: SPARK-15932
 URL: https://issues.apache.org/jira/browse/SPARK-15932
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity

2016-06-13 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15914:
---
Summary: Add deprecated method back to SQLContext for source code backward 
compatiblity  (was: Add deprecated method back to SQLContext for backward 
compatiblity)

> Add deprecated method back to SQLContext for source code backward compatiblity
> --
>
> Key: SPARK-15914
> URL: https://issues.apache.org/jira/browse/SPARK-15914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> We removed some deprecated method in SQLContext in branch Spark 2.0.
> For example:
> {code}
>   @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
>   def jsonFile(path: String): DataFrame = {
> read.json(path)
>   }
> {code}
> These deprecated method may be used by existing third party data source. We 
> probably want to add them back to remain backward-compatibiity. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328487#comment-15328487
 ] 

Manoj Kumar commented on SPARK-3155:


I would like to add support for pruning DecisionTrees as part of my internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?


> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15925) Replaces registerTempTable with createOrReplaceTempView in SparkR

2016-06-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15925.
---
Resolution: Fixed

Issue resolved by pull request 13644
[https://github.com/apache/spark/pull/13644]

> Replaces registerTempTable with createOrReplaceTempView in SparkR
> -
>
> Key: SPARK-15925
> URL: https://issues.apache.org/jira/browse/SPARK-15925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion

2016-06-13 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328466#comment-15328466
 ] 

Kay Ousterhout commented on SPARK-15176:


I thought about this a little more and I think I'm in favor of maxShare instead 
of maxRunningTasks.  The reason is that maxRunningTasks seems brittle to the 
underlying setup -- if someone configures a certain maximum number of tasks, 
and then a few machines die, the maximum may no longer be reasonable (e.g., it 
may become larger than the number of machines in the cluster).  The other 
benefit is symmetry with minShare, as Mark mentioned.

[~njw45] why did you chose maxRunningTasks, as opposed to maxShare?  Are there 
other reasons that maxRunningTasks makes more sense?

> Job Scheduling Within Application Suffers from Priority Inversion
> -
>
> Key: SPARK-15176
> URL: https://issues.apache.org/jira/browse/SPARK-15176
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Nick White
>
> Say I have two pools, and N cores in my cluster:
> * I submit a job to one, which has M >> N tasks
> * N of the M tasks are scheduled
> * I submit a job to the second pool - but none of its tasks get scheduled 
> until a task from the other pool finishes!
> This can lead to unbounded denial-of-service for the second pool - regardless 
> of `minShare` or `weight` settings. Ideally Spark would support a pre-emption 
> mechanism, or an upper bound on a pool's resource usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328432#comment-15328432
 ] 

Shivaram Venkataraman commented on SPARK-15931:
---

cc [~felixcheung] We should print out what are the names of the methods in 
expected vs actual as this has failed before as well

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15776) Type coercion incorrect

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328420#comment-15328420
 ] 

Apache Spark commented on SPARK-15776:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/13651

> Type coercion incorrect
> ---
>
> Key: SPARK-15776
> URL: https://issues.apache.org/jira/browse/SPARK-15776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark based on commit 
> 26c1089c37149061f838129bb53330ded68ff4c9
>Reporter: Weizhong
>Priority: Minor
>
> {code:sql}
> CREATE TABLE cdr (
>   debet_dt  int  ,
>   srv_typ_cdstring   ,
>   b_brnd_cd smallint ,
>   call_dur  int
> )
> ROW FORMAT delimited fields terminated by ','
> STORED AS TEXTFILE;
> {code}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN srv_typ_cd LIKE '0%' THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) ELSE 0 
> END): bigint
> Project [debet_dt#16, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) 
> ELSE 0 END)#27L]
> +- Sort [debet_dt#16 ASC], true
>+- Aggregate [debet_dt#16], [debet_dt#16, sum(cast(CASE WHEN srv_typ_cd#18 
> LIKE 0% THEN (cast(call_dur#21 as double) / cast(60 as double)) ELSE cast(0 
> as double) END as bigint)) AS sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur 
> / 60) ELSE 0 END)#27L]
>   +- MetastoreRelation default, cdr
> {noformat}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN b_brnd_cd IN(1) THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS INT))) 
> THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) 
> END): double
> Project [debet_dt#76, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS 
> INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS 
> DOUBLE) END)#87]
> +- Sort [debet_dt#76 ASC], true
>+- Aggregate [debet_dt#76], [debet_dt#76, sum(CASE WHEN cast(b_brnd_cd#80 
> as int) IN (cast(1 as int)) THEN (cast(call_dur#81 as double) / cast(60 as 
> double)) ELSE cast(0 as double) END) AS sum(CASE WHEN (CAST(b_brnd_cd AS INT) 
> IN (CAST(1 AS INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) 
> ELSE CAST(0 AS DOUBLE) END)#87]
>   +- MetastoreRelation default, cdr
> {noformat}
> The only difference is WHEN condition, but will result different output 
> column type(one is bigint, one is double) 
> We need to apply "Division" before "FunctionArgumentConversion", like below:
> {code:java}
> val typeCoercionRules =
> PropagateTypes ::
>   InConversion ::
>   WidenSetOperationTypes ::
>   PromoteStrings ::
>   DecimalPrecision ::
>   BooleanEquality ::
>   StringToIntegralCasts ::
>   Division ::
>   FunctionArgumentConversion ::
>   CaseWhenCoercion ::
>   IfCoercion ::
>   PropagateTypes ::
>   ImplicitTypeCasts ::
>   DateTimeOperations ::
>   Nil
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328417#comment-15328417
 ] 

Bryan Cutler commented on SPARK-15861:
--

{{mapPartitions}} will expect the function to return a sequence, that's what 
you are referring to right?

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order

2016-06-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328405#comment-15328405
 ] 

Dongjoon Hyun commented on SPARK-15918:
---

Hi, [~Prabhu Joseph].
Instead of changing one of the tables, you just need to use explicit `select`.

If  `df1(a,b)` and `df2(b,a)`, please do the followings.
{code}
df1.union(df2.select("a", "b"))
{code}

IMHO, this is not a problem.

> unionAll returns wrong result when two dataframes has schema in different 
> order
> ---
>
> Key: SPARK-15918
> URL: https://issues.apache.org/jira/browse/SPARK-15918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: CentOS
>Reporter: Prabhu Joseph
>
> On applying unionAll operation between A and B dataframes, they both has same 
> schema but in different order and hence the result has column value mapping 
> changed.
> Repro:
> {code}
> A.show()
> +---++---+--+--+-++---+--+---+---+-+
> |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---++---+--+--+-++---+--+---+---+-+
> +---++---+--+--+-++---+--+---+---+-+
> B.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> +-+---+--+---+---+--+--+--+---+---+--++
> A = A.unionAll(B)
> A.show()
> +---+---+--+--+--+-++---+--+---+---+-+
> |tag|   year_day|   
> tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---+---+--+--+--+-++---+--+---+---+-+
> |  F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> +---+---+--+--+--+-++---+--+---+---+-+
> {code}
> On changing the schema of A according to B and doing unionAll works fine
> {code}
> C = 
> A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day")
> A = C.unionAll(B)
> A.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNH

[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328403#comment-15328403
 ] 

Cheng Lian commented on SPARK-15931:


cc [~mengxr]

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-15931:
--

 Summary: SparkR tests failing on R 3.3.0
 Key: SPARK-15931
 URL: https://issues.apache.org/jira/browse/SPARK-15931
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Cheng Lian


Environment:

# Spark master Git revision: 
[f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
# R version: 3.3.0

To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
Relevant log lines:
{noformat}
...
Failed -
1. Failure: Check masked functions (@test_context.R#44) 
length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
1/1 mismatches
[1] 3 - 5 == -2


2. Failure: Check masked functions (@test_context.R#45) 
sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
Lengths differ: 3 vs 5
...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328390#comment-15328390
 ] 

Reynold Xin commented on SPARK-15690:
-

Yes there is definitely no reason to go through network for a single process. 
Technically we can even bypass the entire DAGScheduler, although that might be 
too much work.


> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15887) Bring back the hive-site.xml support for Spark 2.0

2016-06-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15887.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13611
[https://github.com/apache/spark/pull/13611]

> Bring back the hive-site.xml support for Spark 2.0
> --
>
> Key: SPARK-15887
> URL: https://issues.apache.org/jira/browse/SPARK-15887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, 
> it seems make sense to still load this conf file.
> Originally, this file was loaded when we load HiveConf class and all settings 
> can be retrieved after we create a HiveConf instances. Let's avoid of using 
> this way to load hive-site.xml. Instead, since hive-site.xml is a normal 
> hadoop conf file, we can first find its url using the classloader and then 
> use Hadoop Configuration's addResource (or add hive-site.xml as a default 
> resource through Configuration.addDefaultResource) to load confs.
> Please note that hive-site.xml needs to be loaded into the hadoop conf used 
> to create metadataHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15753) Move some Analyzer stuff to Analyzer from DataFrameWriter

2016-06-13 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328381#comment-15328381
 ] 

Wenchen Fan commented on SPARK-15753:
-

this is reverted, see discussion 
https://github.com/apache/spark/pull/13496#discussion_r66724862

> Move some Analyzer stuff to Analyzer from DataFrameWriter
> -
>
> Key: SPARK-15753
> URL: https://issues.apache.org/jira/browse/SPARK-15753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> DataFrameWriter.insertInto includes some Analyzer stuff. We should move it to 
> Analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Greg Bowyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328380#comment-15328380
 ] 

Greg Bowyer commented on SPARK-15861:
-

... Hum from my end-users testing it does not seem to fail if the map function 
does not return a valid sequence

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328378#comment-15328378
 ] 

Saisai Shao commented on SPARK-15690:
-

I see. Since everything is in a single process, looks like netty layer could be 
by-passed and directly fetched the memory blocks in the reader side. It should 
definitely be faster than the current implementation.

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9623:
---

Assignee: Apache Spark

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread John Aherne (JIRA)
John Aherne created SPARK-15930:
---

 Summary: Add Row count property to FPGrowth model
 Key: SPARK-15930
 URL: https://issues.apache.org/jira/browse/SPARK-15930
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.6.1
Reporter: John Aherne
Priority: Minor


Add a row count property to MLlib's FPGrowth model. 

When using the model from FPGrowth, a count of the total number of records is 
often necessary. 

It appears that the function already calculates that value when training the 
model, so it would save time not having to do it again outside the model. 

Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9623:
---

Assignee: (was: Apache Spark)

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328343#comment-15328343
 ] 

Apache Spark commented on SPARK-9623:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/13650

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328329#comment-15328329
 ] 

Apache Spark commented on SPARK-15929:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13649

> DataFrameSuite path globbing error message tests are not fully portable
> ---
>
> Key: SPARK-15929
> URL: https://issues.apache.org/jira/browse/SPARK-15929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The DataFrameSuite regression tests for SPARK-13774 fail in my environment 
> because they attempt to glob over all of {{/mnt}} and some of the 
> subdirectories in there have restrictive permissions which cause the test to 
> fail. I think we should rewrite this test to not depend existing / OS paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable

2016-06-13 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-15929:
--

 Summary: DataFrameSuite path globbing error message tests are not 
fully portable
 Key: SPARK-15929
 URL: https://issues.apache.org/jira/browse/SPARK-15929
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


The DataFrameSuite regression tests for SPARK-13774 fail in my environment 
because they attempt to glob over all of {{/mnt}} and some of the 
subdirectories in there have restrictive permissions which cause the test to 
fail. I think we should rewrite this test to not depend existing / OS paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-15928) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout deleted SPARK-15928:
---


> Eliminate redundant code in DAGScheduler's getParentStages and 
> getAncestorShuffleDependencies methods.
> --
>
> Key: SPARK-15928
> URL: https://issues.apache.org/jira/browse/SPARK-15928
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The getParentStages and getAncestorShuffleDependencies methods have a lot of 
> repeated code to traverse the dependency graph.  We should create a function 
> that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328266#comment-15328266
 ] 

Shixiong Zhu commented on SPARK-15905:
--

Do you have a reproducer? What does your code look like?

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328264#comment-15328264
 ] 

Reynold Xin commented on SPARK-15690:
-

Yup. Eventually we can also generalize this to multiple process (e.g. cluster).

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245
 ] 

Bryan Cutler edited comment on SPARK-15861 at 6/13/16 9:05 PM:
---

[~gbow...@fastmail.co.uk]

{{mapPartitions}} expects a function that takes an iterator as input then 
outputs an iterable sequence, and your function in the example is actually 
providing this.  I think what is going on here is your function will map the 
iterator to a numpy array, that internally will be something like  
{noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first 
partition, then {{collect}} will iterate over that sequence and return each 
element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 
3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on..

I believe this is working as it is supposed to, and in general, 
{{mapPartitions}} will not usually give the same result as {{map}} - it will 
fail if the function does not return a valid sequence.  The documentation could 
perhaps be a little clearer in that regard.


was (Author: bryanc):
[~gbow...@fastmail.co.uk]

{{mapPartitions}} expects a function the takes an iterator as input then 
outputs an iterable sequence, and your function in the example is actually 
providing this.  I think what is going on here is your function will map the 
iterator to a numpy array, that internally will be something like  
{noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first 
partition, then {{collect}} will iterate over that sequence and return each 
element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 
3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on..

I believe this is working as it is supposed to, and in general, 
{{mapPartitions}} will not usually give the same result as {{map}} - it will 
fail if the function does not return a valid sequence.  The documentation could 
perhaps be a little clearer in that regard.

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5374) abstract RDD's DAG graph iteration in DAGScheduler

2016-06-13 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-5374.
-
Resolution: Duplicate

Closing this because it duplicates the more narrowly-scoped JIRAs linked above.

> abstract RDD's DAG graph iteration in DAGScheduler
> --
>
> Key: SPARK-5374
> URL: https://issues.apache.org/jira/browse/SPARK-5374
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Wenchen Fan
>
> DAGScheduler has many methods that iterate an RDD's DAG graph, we should 
> abstract the iterate process to reduce code size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328255#comment-15328255
 ] 

Saisai Shao commented on SPARK-15690:
-

Hi [~rxin], what's the meaning of "single-process", is that referring to 
something similar to local mode? 

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245
 ] 

Bryan Cutler commented on SPARK-15861:
--

[~gbow...@fastmail.co.uk]

{{mapPartitions}} expects a function the takes an iterator as input then 
outputs an iterable sequence, and your function in the example is actually 
providing this.  I think what is going on here is your function will map the 
iterator to a numpy array, that internally will be something like  
{noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first 
partition, then {{collect}} will iterate over that sequence and return each 
element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 
3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on..

I believe this is working as it is supposed to, and in general, 
{{mapPartitions}} will not usually give the same result as {{map}} - it will 
fail if the function does not return a valid sequence.  The documentation could 
perhaps be a little clearer in that regard.

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15889.
--
Resolution: Fixed

> Add a unique id to ContinuousQuery
> --
>
> Key: SPARK-15889
> URL: https://issues.apache.org/jira/browse/SPARK-15889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> ContinuousQueries have names that are unique across all the active ones. 
> However, when queries are rapidly restarted with same name, it causes races 
> conditions with the listener. A listener event from a stopped query can 
> arrive after the query has been restarted, leading to complexities in 
> monitoring infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-15889:
-
Fix Version/s: 2.0.0

> Add a unique id to ContinuousQuery
> --
>
> Key: SPARK-15889
> URL: https://issues.apache.org/jira/browse/SPARK-15889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> ContinuousQueries have names that are unique across all the active ones. 
> However, when queries are rapidly restarted with same name, it causes races 
> conditions with the listener. A listener event from a stopped query can 
> arrive after the query has been restarted, leading to complexities in 
> monitoring infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism

2016-06-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15530:
-
Assignee: Takeshi Yamamuro

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.0
>
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism

2016-06-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15530.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13444
[https://github.com/apache/spark/pull/13444]

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 2.0.0
>
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Xuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328205#comment-15328205
 ] 

Xuan Wang commented on SPARK-15924:
---

I then realized that this is not a problem with SparkR, so I closed the issue. 
Thanks!

> SparkR parser bug with backslash in comments
> 
>
> Key: SPARK-15924
> URL: https://issues.apache.org/jira/browse/SPARK-15924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Xuan Wang
>
> When I run an R cell with the following comments:
> {code} 
> #   p <- p + scale_fill_manual(values = set2[groups])
> #   # p <- p + scale_fill_brewer(palette = "Set2") + 
> scale_color_brewer(palette = "Set2")
> #   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
> #   p
> {code}
> I get the following error message
> {quote}
>   :16:1: unexpected input
> 15: #   p <- p + scale_x_date(labels = date_format("%m/%d
> 16: %a"))
> ^
> {quote}
> After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328200#comment-15328200
 ] 

Herman van Hovell commented on SPARK-15822:
---

[~robbinspg] Could you try this without caching?

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15676) Disallow Column Names as Partition Columns For Hive Tables

2016-06-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15676.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13415
[https://github.com/apache/spark/pull/13415]

> Disallow Column Names as Partition Columns For Hive Tables
> --
>
> Key: SPARK-15676
> URL: https://issues.apache.org/jira/browse/SPARK-15676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> Below is a common mistake users might make:
> {noformat}
> hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data 
> string, part string);
> FAILED: SemanticException [Error 10035]: Column repeated in partitioning 
> columns
> {noformat}
> Different from what Hive returned, currently, we return a confusing error 
> message:
> {noformat}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For 
> direct MetaStore DB connections, we don't support retries at the client 
> level.);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15676) Disallow Column Names as Partition Columns For Hive Tables

2016-06-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15676:
-
Assignee: Xiao Li

> Disallow Column Names as Partition Columns For Hive Tables
> --
>
> Key: SPARK-15676
> URL: https://issues.apache.org/jira/browse/SPARK-15676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Below is a common mistake users might make:
> {noformat}
> hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data 
> string, part string);
> FAILED: SemanticException [Error 10035]: Column repeated in partitioning 
> columns
> {noformat}
> Different from what Hive returned, currently, we return a confusing error 
> message:
> {noformat}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For 
> direct MetaStore DB connections, we don't support retries at the client 
> level.);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >