[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large

2017-03-07 Thread Cui Xixin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900871#comment-15900871
 ] 

Cui Xixin commented on SPARK-6951:
--

In my case,the inprogress file is the main reason, so if the inprogress file is 
necessary for the history server, or if  the  inprogress file can be cleaned up 
with the 'spark.history.fs.cleaner.maxAge'

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories

2017-03-07 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900821#comment-15900821
 ] 

Saisai Shao edited comment on SPARK-19812 at 3/8/17 7:42 AM:
-

[~tgraves], I'm not quite sure what you mean here?

bq. The tests are using files rather then directories so it didn't catch. We 
need to fix the test also.

>From my understanding this issues happens when dest dir is not empty and try 
>to move with REPLACE_EXISTING. Also be happened when calling rename failed and 
>the source dir is not empty directory.

But I cannot imagine how this happened, from the log it is more like a `rename` 
failure issue, since the path in Exception points to source dir.



was (Author: jerryshao):
[~tgraves], I'm not quite sure what you mean here?

bq. The tests are using files rather then directories so it didn't catch. We 
need to fix the test also.

>From my understanding this issues happens when dest dir is not empty and try 
>to move with REPLACE_EXISTING. Also be happened when calling rename failed and 
>the source dir is not empty directory.

But I cannot imagine how this happened, because if dest dir is not empty, then 
it should be returned before, will not go to check old NM local dirs.



> YARN shuffle service fails to relocate recovery DB directories
> --
>
> Key: SPARK-19812
> URL: https://issues.apache.org/jira/browse/SPARK-19812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The yarn shuffle service tries to switch from the yarn local directories to 
> the real recovery directory but can fail to move the existing recovery db's.  
> It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
> recovery file sparkShuffleRecovery.ldb to the path 
> /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
> at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
> at 
> sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
> at java.nio.file.Files.move(Files.java:1395)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review 
> comments and it looks like didn't do a final real test. The tests are using 
> files rather then directories so it didn't catch. We need to fix the test 
> also.
> history: 
> https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories

2017-03-07 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900821#comment-15900821
 ] 

Saisai Shao edited comment on SPARK-19812 at 3/8/17 7:40 AM:
-

[~tgraves], I'm not quite sure what you mean here?

bq. The tests are using files rather then directories so it didn't catch. We 
need to fix the test also.

>From my understanding this issues happens when dest dir is not empty and try 
>to move with REPLACE_EXISTING. Also be happened when calling rename failed and 
>the source dir is not empty directory.

But I cannot imagine how this happened, because if dest dir is not empty, then 
it should be returned before, will not go to check old NM local dirs.




was (Author: jerryshao):
[~tgraves], I'm not quite sure what you mean here?

bq. The tests are using files rather then directories so it didn't catch. We 
need to fix the test also.

>From my understanding this issues happens when dest dir is not empty and try 
>to move with REPLACE_EXISTING, but I cannot imagine how this happened, because 
>if dest dir is not empty, then it should be returned before, will not go to 
>check old NM local dirs.

> YARN shuffle service fails to relocate recovery DB directories
> --
>
> Key: SPARK-19812
> URL: https://issues.apache.org/jira/browse/SPARK-19812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The yarn shuffle service tries to switch from the yarn local directories to 
> the real recovery directory but can fail to move the existing recovery db's.  
> It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
> recovery file sparkShuffleRecovery.ldb to the path 
> /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
> at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
> at 
> sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
> at java.nio.file.Files.move(Files.java:1395)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review 
> comments and it looks like didn't do a final real test. The tests are using 
> files rather then directories so it didn't catch. We need to fix the test 
> also.
> history: 
> https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle

2017-03-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900825#comment-15900825
 ] 

Nick Pentreath commented on SPARK-13969:


I think {{HashingTF}} and {{FeatureHasher}} are different things - similar to 
HashingVectorizer and FeatureHasher in scikit-learn.

{{HashingTF}} (HashingVectorizer) transforms a Seq of terms (typically 
{{String}}) into a term frequency vector. Yes technically it can operate on Seq 
of any type (well actually only strings and numbers, see [murmur3Hash 
function|https://github.com/apache/spark/blob/60022bfd65e4637efc0eb5f4cc0112289c783147/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala#L151]).
 It could certainly operate on multiple columns - that would hash all the 
columns of sentences into one term frequency vector, so it seems like it would 
probably be less used in practice (though Vowpal Wabbit supports a form of this 
with its namespaces).

What {{HashingTF}} does not support is arbitrary categorical or numeric 
columns. It is possible to support categorical "one-hot" style encoding using 
what I have come to call the "stringify hack" - transforming a set of 
categorical columns into a Seq for input to HashingTF.

So taking say two categorical columns {{city}} and {{state}}, for example:
{code}
++-+-+
|city|state|stringified  |
++-+-+
|Boston  |MA   |[city=Boston, state=MA]  |
|New York|NY   |[city=New York, state=NY]|
++-+-+
{code}

This works but is pretty ugly, doesn't fit nicely into a pipeline, and can't 
support numeric columns.

The {{FeatureHasher}} I propose acts like that in scikit-learn - it can handle 
multiple numeric and/or categorical columns in one pass. I go into some detail 
about all of this in my [Spark Summit East 2017 
talk|https://www.slideshare.net/SparkSummit/feature-hashing-for-scalable-machine-learning-spark-summit-east-talk-by-nick-pentreath].
 The rough draft of it used for the talk is 
[here|https://github.com/MLnick/spark/blob/FeatureHasher/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala].

Another nice thing about the {{FeatureHasher}} is it opens up possibilities for 
doing things like namespaces in Vowpal Wabbit and it would be interesting to 
see if we could mimic their internal feature crossing, and so on.

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-03-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900824#comment-15900824
 ] 

Takeshi Yamamuro commented on SPARK-15463:
--

Have you seen https://github.com/apache/spark/pull/13300#issuecomment-261156734 
as related discussion?
Currently, I think [~hyukjin.kwon]'s idea is more preferable: 
https://github.com/apache/spark/pull/16854#issue-206224691.

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories

2017-03-07 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900821#comment-15900821
 ] 

Saisai Shao commented on SPARK-19812:
-

[~tgraves], I'm not quite sure what you mean here?

bq. The tests are using files rather then directories so it didn't catch. We 
need to fix the test also.

>From my understanding this issues happens when dest dir is not empty and try 
>to move with REPLACE_EXISTING, but I cannot imagine how this happened, because 
>if dest dir is not empty, then it should be returned before, will not go to 
>check old NM local dirs.

> YARN shuffle service fails to relocate recovery DB directories
> --
>
> Key: SPARK-19812
> URL: https://issues.apache.org/jira/browse/SPARK-19812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The yarn shuffle service tries to switch from the yarn local directories to 
> the real recovery directory but can fail to move the existing recovery db's.  
> It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
> recovery file sparkShuffleRecovery.ldb to the path 
> /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
> at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
> at 
> sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
> at java.nio.file.Files.move(Files.java:1395)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review 
> comments and it looks like didn't do a final real test. The tests are using 
> files rather then directories so it didn't catch. We need to fix the test 
> also.
> history: 
> https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900773#comment-15900773
 ] 

Apache Spark commented on SPARK-19843:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17205

> UTF8String => (int / long) conversion expensive for invalid inputs
> --
>
> Key: SPARK-19843
> URL: https://issues.apache.org/jira/browse/SPARK-19843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.2.0
>
>
> In case of invalid inputs, converting a UTF8String to int or long returns 
> null. This comes at a cost wherein the method for conversion (e.g [0]) would 
> throw an exception. Exception handling is expensive as it will convert the 
> UTF8String into a java string, populate the stack trace (which is a native 
> call). While migrating workloads from Hive -> Spark, I see that this at an 
> aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not 
> propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL 
> if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but 
> helpful to capture typical bad inputs eg. empty string, "null".
> [0] : 
> https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : 
> https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-03-07 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900741#comment-15900741
 ] 

jin xing edited comment on SPARK-19659 at 3/8/17 5:47 AM:
--

[~irashid] [~rxin]
I uploaded SPARK-19659-design-v2.pdf, please take a look.

Yes, only outliers should be tracked and MapStatus should stay compact. It is a 
good idea to track all the sizes that are more than 2x the average. “2x” is a 
default value, it will be configurable by parameter.

We need to collect metrics for stability and performance:

- For stability:
 1) Blocks which have size between the average and 2x the average are 
underestimated. So there is a risk that those blocks cannot fit in memory. In 
this approach, driver should calculate the sum of underestimated sizes. 
 2) Show block sizes’ distribution of MapStatus, the distribution will 
between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a 
lot for debugging(e.g. find skew situations).
* For performance:
 1) Peak memory(off heap or on heap) used for fetching blocks should be tracked.
 2) Fetching blocks to disk will cause performance degradation. Executor should 
calculate the size of blocks shuffled to disk and the time cost. 

Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and 
prs). The metrics part should be done first before other parts proposed here.

What's more, why not make 2000 as a configurable parameter, thus user can chose 
the track the accurate sizes of blocks to some level.




was (Author: jinxing6...@126.com):
[~irashid] [~rxin]
I uploaded SPARK-19659-design-v2.pdf, please take a look.

Yes, only outliers should be tracked and MapStatus should stay compact. It is a 
good idea to track all the sizes that are more than 2x the average. “2x” is a 
default value, it will be configurable by parameter.

We need to collect metrics for stability and performance:

- For stability:
 1) Blocks which have size between the average and 2x the average are 
underestimated. So there is a risk that those blocks cannot fit in memory. In 
this approach, driver should calculate the sum of underestimated sizes. 
 2) Show block sizes’ distribution of MapStatus, the distribution will 
between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a 
lot for debugging(e.g. find skew situations).
* For performance:
 1) Memory(off heap or on heap) used for fetching blocks should be tracked.
 2) Fetching blocks to disk will cause performance degradation. Executor should 
calculate the size of blocks shuffled to disk and the time cost. 

Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and 
prs). The metrics part should be done first before other parts proposed here.

What's more, why not make 2000 as a configurable parameter, thus user can chose 
the track the accurate sizes of blocks to some level.



> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-03-07 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900741#comment-15900741
 ] 

jin xing commented on SPARK-19659:
--

[~irashid] [~rxin]
I uploaded SPARK-19659-design-v2.pdf, please take a look.

Yes, only outliers should be tracked and MapStatus should stay compact. It is a 
good idea to track all the sizes that are more than 2x the average. “2x” is a 
default value, it will be configurable by parameter.

We need to collect metrics for stability and performance:

- For stability:
 1) Blocks which have size between the average and 2x the average are 
underestimated. So there is a risk that those blocks cannot fit in memory. In 
this approach, driver should calculate the sum of underestimated sizes. 
 2) Show block sizes’ distribution of MapStatus, the distribution will 
between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a 
lot for debugging(e.g. find skew situations).
* For performance:
 1) Memory(off heap or on heap) used for fetching blocks should be tracked.
 2) Fetching blocks to disk will cause performance degradation. Executor should 
calculate the size of blocks shuffled to disk and the time cost. 

Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and 
prs). The metrics part should be done first before other parts proposed here.

What's more, why not make 2000 as a configurable parameter, thus user can chose 
the track the accurate sizes of blocks to some level.



> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-03-07 Thread jin xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-19659:
-
Attachment: SPARK-19659-design-v2.pdf

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

2017-03-07 Thread wuchang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuchang updated SPARK-19860:

Description: 
{code}
>>> print df1.collect()
[Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
Row(fdate=u'20170301', in_amount1=10159653)]
>>> print df2.collect()
[Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
Row(fdate=u'20170301', in_amount2=9475418)]
>>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
jdf = self._jdf.join(other._jdf, on._jc, how)
  File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 
933, in __call__
  File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"
Failure when resolving conflicting references in Join:
'Join Inner, (fdate#42 = fdate#42)
:- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int) AS in_amount1#97]
:  +- Filter (inorout#44 = A)
: +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
:+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT 
(firm_id#40 = -1) && (fdate#42 >= 20170201)))
:   +- SubqueryAlias history_transfer_v
:  +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
bankwaterid#48, waterid#49, waterstate#50, source#51]
: +- SubqueryAlias history_transfer
:+- 
Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
 parquet
+- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int) AS in_amount2#145]
   +- Filter (inorout#44 = B)
  +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
 +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT 
(firm_id#40 = -1) && (fdate#42 >= 20170201)))
 +- SubqueryAlias history_transfer_v
 +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
bankwaterid#48, waterid#49, waterstate#50, source#51]
 +- SubqueryAlias history_transfer
 +- 
Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49

[jira] [Resolved] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-03-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19348.
---
   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1

Issue resolved by pull request 17195
[https://github.com/apache/spark/pull/17195]

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Vinayak Joshi
>Assignee: Bryan Cutler
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API

2017-03-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19866:
--
Shepherd: Joseph K. Bradley

> Add local version of Word2Vec findSynonyms for spark.ml: Python API
> ---
>
> Key: SPARK-19866
> URL: https://issues.apache.org/jira/browse/SPARK-19866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add Python API for findSynonymsArray matching Scala API in linked JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API

2017-03-07 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19866:
-

 Summary: Add local version of Word2Vec findSynonyms for spark.ml: 
Python API
 Key: SPARK-19866
 URL: https://issues.apache.org/jira/browse/SPARK-19866
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Priority: Minor


Add Python API for findSynonymsArray matching Scala API in linked JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17629) Add local version of Word2Vec findSynonyms for spark.ml

2017-03-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17629.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16811
[https://github.com/apache/spark/pull/16811]

> Add local version of Word2Vec findSynonyms for spark.ml
> ---
>
> Key: SPARK-17629
> URL: https://issues.apache.org/jira/browse/SPARK-17629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Asher Krim
>Assignee: Asher Krim
>Priority: Minor
> Fix For: 2.2.0
>
>
> ml Word2Vec's findSynonyms methods depart from mllib in that they return 
> distributed results, rather than the results directly:
> {code}
>   def findSynonyms(word: String, num: Int): DataFrame = {
> val spark = SparkSession.builder().getOrCreate()
> spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
> "similarity")
>   }
> {code}
> What was the reason for this decision? I would think that most users would 
> request a reasonably small number of results back, and want to use them 
> directly on the driver, similar to the _take_ method on dataframes. Returning 
> parallelized results creates a costly round trip for the data that doesn't 
> seem necessary.
> The original PR: https://github.com/apache/spark/pull/7263
> [~MechCoder] - do you perhaps recall the reason?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19859.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys

2017-03-07 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19841.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
> 
>
> Key: SPARK-19841
> URL: https://issues.apache.org/jira/browse/SPARK-19841
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Right now it just uses the rows to filter but a column position in 
> keyExpressions may be different than the position in row.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19865) remove the view identifier in SubqueryAlias

2017-03-07 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19865:
---

 Summary: remove the view identifier in SubqueryAlias
 Key: SPARK-19865
 URL: https://issues.apache.org/jira/browse/SPARK-19865
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Jiang Xingbo


Since we have a `View` node now, we can remove the view identifier in 
`SubqueryAlias`, which was used to indicate a view node before



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18389) Disallow cyclic view reference

2017-03-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18389:
---

Assignee: Jiang Xingbo

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18389) Disallow cyclic view reference

2017-03-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18389.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17152
[https://github.com/apache/spark/pull/17152]

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.2.0
>
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19864:


Assignee: (was: Apache Spark)

> add makeQualifiedPath in CatalogUtils to optimize some code
> ---
>
> Key: SPARK-19864
> URL: https://issues.apache.org/jira/browse/SPARK-19864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> Currently there are lots of places to make the path qualified, it is better 
> to provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900675#comment-15900675
 ] 

Apache Spark commented on SPARK-19864:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/17204

> add makeQualifiedPath in CatalogUtils to optimize some code
> ---
>
> Key: SPARK-19864
> URL: https://issues.apache.org/jira/browse/SPARK-19864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> Currently there are lots of places to make the path qualified, it is better 
> to provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19864:


Assignee: Apache Spark

> add makeQualifiedPath in CatalogUtils to optimize some code
> ---
>
> Key: SPARK-19864
> URL: https://issues.apache.org/jira/browse/SPARK-19864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Apache Spark
>Priority: Minor
>
> Currently there are lots of places to make the path qualified, it is better 
> to provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

2017-03-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19843:
---

Assignee: Tejas Patil

> UTF8String => (int / long) conversion expensive for invalid inputs
> --
>
> Key: SPARK-19843
> URL: https://issues.apache.org/jira/browse/SPARK-19843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.2.0
>
>
> In case of invalid inputs, converting a UTF8String to int or long returns 
> null. This comes at a cost wherein the method for conversion (e.g [0]) would 
> throw an exception. Exception handling is expensive as it will convert the 
> UTF8String into a java string, populate the stack trace (which is a native 
> call). While migrating workloads from Hive -> Spark, I see that this at an 
> aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not 
> propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL 
> if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but 
> helpful to capture typical bad inputs eg. empty string, "null".
> [0] : 
> https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : 
> https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

2017-03-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19843.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17184
[https://github.com/apache/spark/pull/17184]

> UTF8String => (int / long) conversion expensive for invalid inputs
> --
>
> Key: SPARK-19843
> URL: https://issues.apache.org/jira/browse/SPARK-19843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
> Fix For: 2.2.0
>
>
> In case of invalid inputs, converting a UTF8String to int or long returns 
> null. This comes at a cost wherein the method for conversion (e.g [0]) would 
> throw an exception. Exception handling is expensive as it will convert the 
> UTF8String into a java string, populate the stack trace (which is a native 
> call). While migrating workloads from Hive -> Spark, I see that this at an 
> aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not 
> propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL 
> if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but 
> helpful to capture typical bad inputs eg. empty string, "null".
> [0] : 
> https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : 
> https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code

2017-03-07 Thread Song Jun (JIRA)
Song Jun created SPARK-19864:


 Summary: add makeQualifiedPath in CatalogUtils to optimize some 
code
 Key: SPARK-19864
 URL: https://issues.apache.org/jira/browse/SPARK-19864
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


Currently there are lots of places to make the path qualified, it is better to 
provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19863:


Assignee: Apache Spark

> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
> 
>
> Key: SPARK-19863
> URL: https://issues.apache.org/jira/browse/SPARK-19863
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>Assignee: Apache Spark
>
> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming 
> application. In Spark 2.x, the kafka consumer was replaced by 
> CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka 
> cluster), and cannot change the way. In fact ,The KafkaRDD(used by 
> DirectKafkaInputDStream to connect kafka) provide the parameter 
> useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the 
> DirectKafkaInputDStream set the parameter true.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19863:


Assignee: (was: Apache Spark)

> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
> 
>
> Key: SPARK-19863
> URL: https://issues.apache.org/jira/browse/SPARK-19863
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>
> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming 
> application. In Spark 2.x, the kafka consumer was replaced by 
> CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka 
> cluster), and cannot change the way. In fact ,The KafkaRDD(used by 
> DirectKafkaInputDStream to connect kafka) provide the parameter 
> useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the 
> DirectKafkaInputDStream set the parameter true.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900669#comment-15900669
 ] 

Apache Spark commented on SPARK-19863:
--

User 'lvdongr' has created a pull request for this issue:
https://github.com/apache/spark/pull/17203

> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
> 
>
> Key: SPARK-19863
> URL: https://issues.apache.org/jira/browse/SPARK-19863
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>
> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming 
> application. In Spark 2.x, the kafka consumer was replaced by 
> CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka 
> cluster), and cannot change the way. In fact ,The KafkaRDD(used by 
> DirectKafkaInputDStream to connect kafka) provide the parameter 
> useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the 
> DirectKafkaInputDStream set the parameter true.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application

2017-03-07 Thread LvDongrong (JIRA)
LvDongrong created SPARK-19863:
--

 Summary: Whether or not use CachedKafkaConsumer need to be 
configured, when you use DirectKafkaInputDStream to connect the kafka in a 
Spark Streaming application
 Key: SPARK-19863
 URL: https://issues.apache.org/jira/browse/SPARK-19863
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Input/Output
Affects Versions: 2.1.0
Reporter: LvDongrong


Whether or not use CachedKafkaConsumer need to be configured, when you use 
DirectKafkaInputDStream to connect the kafka in a Spark Streaming application. 
In Spark 2.x, the kafka consumer was replaced by CachedKafkaConsumer (some 
KafkaConsumer will keep establishing the kafka cluster), and cannot change the 
way. In fact ,The KafkaRDD(used by DirectKafkaInputDStream to connect kafka) 
provide the parameter useConsumerCache to choose Whether to use the 
CachedKafkaConsumer, but the DirectKafkaInputDStream set the parameter true.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle

2017-03-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900662#comment-15900662
 ] 

Joseph K. Bradley commented on SPARK-13969:
---

Noticing this JIRA again.  I feel like this is partly solved:
* HashingTF can handle pretty much arbitrary types.
* HashingTF should have identical behavior across languages and JVM versions.  
(These first 2 are because HashingTF converts the input to StringType and then 
hashes it using MurmurHash3.)

with 1 remaining item:
* HashingTF does not take multiple input columns.  That would effectively make 
it handle VectorAssembler's job as well.

What are your thoughts here?  Should we just stick with VectorAssembler?

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-07 Thread guoxiaolong (JIRA)
guoxiaolong created SPARK-19862:
---

 Summary: In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can 
be deleted. 
 Key: SPARK-19862
 URL: https://issues.apache.org/jira/browse/SPARK-19862
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: guoxiaolong
Priority: Minor


"tungsten-sort" -> 
classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
deleted. Because it is the same of "sort" -> 
classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19861) watermark should not be a negative time.

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900631#comment-15900631
 ] 

Apache Spark commented on SPARK-19861:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17202

> watermark should not be a negative time.
> 
>
> Key: SPARK-19861
> URL: https://issues.apache.org/jira/browse/SPARK-19861
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19861:


Assignee: (was: Apache Spark)

> watermark should not be a negative time.
> 
>
> Key: SPARK-19861
> URL: https://issues.apache.org/jira/browse/SPARK-19861
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19861:


Assignee: Apache Spark

> watermark should not be a negative time.
> 
>
> Key: SPARK-19861
> URL: https://issues.apache.org/jira/browse/SPARK-19861
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19861) watermark should not be a negative time.

2017-03-07 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-19861:
-

 Summary: watermark should not be a negative time.
 Key: SPARK-19861
 URL: https://issues.apache.org/jira/browse/SPARK-19861
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing

2017-03-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900608#comment-15900608
 ] 

Takeshi Yamamuro commented on SPARK-18359:
--

Since JDK9 use CLDR as locale by default, it seems good to explicitly handle 
it: http://openjdk.java.net/jeps/252

> Let user specify locale in CSV parsing
> --
>
> Key: SPARK-18359
> URL: https://issues.apache.org/jira/browse/SPARK-18359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: yannick Radji
>
> On the DataFrameReader object there no CSV-specific option to set decimal 
> delimiter on comma whereas dot like it use to be in France and Europe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

2017-03-07 Thread wuchang (JIRA)
wuchang created SPARK-19860:
---

 Summary: DataFrame join get conflict error if two frames has a 
same name column.
 Key: SPARK-19860
 URL: https://issues.apache.org/jira/browse/SPARK-19860
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: wuchang


>>> print df1.collect()
[Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
Row(fdate=u'20170301', in_amount1=10159653)]
>>> print df2.collect()
[Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
Row(fdate=u'20170301', in_amount2=9475418)]
>>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
jdf = self._jdf.join(other._jdf, on._jc, how)
  File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 
933, in __call__
  File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"
Failure when resolving conflicting references in Join:
'Join Inner, (fdate#42 = fdate#42)
:- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int) AS in_amount1#97]
:  +- Filter (inorout#44 = A)
: +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
:+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT 
(firm_id#40 = -1) && (fdate#42 >= 20170201)))
:   +- SubqueryAlias history_transfer_v
:  +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
bankwaterid#48, waterid#49, waterstate#50, source#51]
: +- SubqueryAlias history_transfer
:+- 
Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
 parquet
+- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int) AS in_amount2#145]
   +- Filter (inorout#44 = B)
  +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
 +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT 
(firm_id#40 = -1) && (fdate#42 >= 20170201)))
 +- SubqueryAlias history_transfer_v
 +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
bankwaterid#48, waterid#49, waterstate#50, sour

[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900568#comment-15900568
 ] 

Apache Spark commented on SPARK-18055:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/17201

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18055:


Assignee: Michael Armbrust  (was: Apache Spark)

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18055:


Assignee: Apache Spark  (was: Michael Armbrust)

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Apache Spark
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18055:
-
Target Version/s: 2.2.0

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19810) Remove support for Scala 2.10

2017-03-07 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900548#comment-15900548
 ] 

Min Shen commented on SPARK-19810:
--

[~srowen],

Want to get an idea regarding the timeline for removing Scala 2.10.
We have heavy usage of Spark at LinkedIn, and we are right now still deploying 
Spark built with Scala 2.10 due to various dependencies on other systems we 
have which still rely on Scala 2.10.
While we also have plans to upgrade our various internal systems to start using 
Scala 2.11, it will take a while for that to happen.
In the mean time, if support for Scala 2.10 is removed in Spark 2.2, this is 
going to potentially block us from upgrading to Spark 2.2+ while we haven't 
fully moved off Scala 2.10 yet.
Want to raise this concern here and also to understand the timeline for 
removing Scala 2.10 in Spark.

> Remove support for Scala 2.10
> -
>
> Key: SPARK-19810
> URL: https://issues.apache.org/jira/browse/SPARK-19810
> Project: Spark
>  Issue Type: Task
>  Components: ML, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
>
> This tracks the removal of Scala 2.10 support, as discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html
>  and other lists.
> The primary motivations are to simplify the code and build, and to enable 
> Scala 2.12 support later. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-18055:


Assignee: Michael Armbrust

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2017-03-07 Thread Jim Kleckner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900533#comment-15900533
 ] 

Jim Kleckner commented on SPARK-16333:
--

I ended up here when looking into why an upgrade of our streaming computation 
to 2.1.0 was pegging the network at a gigabit/second.
Setting to spark.eventLog.enabled to false confirmed that this logging from 
slave port 50010 was the culprit.

How can anyone with seriously large numbers of tasks use spark history with 
this amount of load?

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900524#comment-15900524
 ] 

Apache Spark commented on SPARK-19561:
--

User 'JasonMWhite' has created a pull request for this issue:
https://github.com/apache/spark/pull/17200

> Pyspark Dataframes don't allow timestamps near epoch
> 
>
> Key: SPARK-19561
> URL: https://issues.apache.org/jira/browse/SPARK-19561
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 2.1.1, 2.2.0
>
>
> Pyspark does not allow timestamps at or near the epoch to be created in a 
> DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299
> TimestampType.toInternal converts a datetime object to a number representing 
> microseconds since the epoch. For all times more than 2148 seconds before or 
> after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J 
> automatically serializes it as a long.
> However, for times within this range (~35 minutes before or after the epoch), 
> Py4J serializes it as an int. When creating the object on the Scala side, 
> ints are not recognized and the value goes to null. This leads to null values 
> in non-nullable fields, and corrupted Parquet files.
> The solution is trivial - force TimestampType.toInternal to always return a 
> long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-07 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900463#comment-15900463
 ] 

Vincent commented on SPARK-19852:
-

I can work on this issue, since it is related to SPARK-17498

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19857.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.2.0
   2.1.1

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900410#comment-15900410
 ] 

Apache Spark commented on SPARK-19859:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17199

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19859:


Assignee: Shixiong Zhu  (was: Apache Spark)

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19859:


Assignee: Apache Spark  (was: Shixiong Zhu)

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19859:


 Summary: The new watermark should override the old one
 Key: SPARK-19859
 URL: https://issues.apache.org/jira/browse/SPARK-19859
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


The new watermark should override the old one. Otherwise, we just pick up the 
first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19858:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900377#comment-15900377
 ] 

Apache Spark commented on SPARK-19858:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17197

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19858:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19858:


 Summary: Add output mode to flatMapGroupsWithState and disallow 
invalid cases
 Key: SPARK-19858
 URL: https://issues.apache.org/jira/browse/SPARK-19858
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900363#comment-15900363
 ] 

Apache Spark commented on SPARK-19857:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17198

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19857:


Assignee: (was: Apache Spark)

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19857:


Assignee: Apache Spark

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-19857:
--

 Summary: CredentialUpdater calculates the wrong time for next 
update
 Key: SPARK-19857
 URL: https://issues.apache.org/jira/browse/SPARK-19857
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin


This is the code:

{code}
val remainingTime = 
getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
  - System.currentTimeMillis()
{code}

If you spot the problem, you get a virtual cookie.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19855:


Assignee: Reynold Xin  (was: Apache Spark)

> Create an internal FilePartitionStrategy interface
> --
>
> Key: SPARK-19855
> URL: https://issues.apache.org/jira/browse/SPARK-19855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900315#comment-15900315
 ] 

Apache Spark commented on SPARK-19855:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17196

> Create an internal FilePartitionStrategy interface
> --
>
> Key: SPARK-19855
> URL: https://issues.apache.org/jira/browse/SPARK-19855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19855:


Assignee: Apache Spark  (was: Reynold Xin)

> Create an internal FilePartitionStrategy interface
> --
>
> Key: SPARK-19855
> URL: https://issues.apache.org/jira/browse/SPARK-19855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite from integration tests into unit tests

2017-03-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-19856:

Summary: Turn partitioning related test cases in FileSourceStrategySuite 
from integration tests into unit tests  (was: Turn partitioning related test 
cases in FileSourceStrategySuite into unit tests)

> Turn partitioning related test cases in FileSourceStrategySuite from 
> integration tests into unit tests
> --
>
> Key: SPARK-19856
> URL: https://issues.apache.org/jira/browse/SPARK-19856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-19855:
---

 Summary: Create an internal FilePartitionStrategy interface
 Key: SPARK-19855
 URL: https://issues.apache.org/jira/browse/SPARK-19855
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite into unit tests

2017-03-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-19856:
---

 Summary: Turn partitioning related test cases in 
FileSourceStrategySuite into unit tests
 Key: SPARK-19856
 URL: https://issues.apache.org/jira/browse/SPARK-19856
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19854) Refactor file partitioning strategy to make it easier to extend / unit test

2017-03-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-19854:
---

 Summary: Refactor file partitioning strategy to make it easier to 
extend / unit test
 Key: SPARK-19854
 URL: https://issues.apache.org/jira/browse/SPARK-19854
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


The way we currently do file partitioning strategy is hard coded in 
FileSourceScanExec. This is not ideal for two reasons:

1. It is difficult to unit test the default strategy. In order to test this, we 
need to do almost end-to-end tests by creating actual files on the file system.

2. It is difficult to experiment with different partitioning strategies without 
adding a lot of if branches.

The goal of this story is to create an internal interface for this so we can 
make this pluggable for both better testing and experimentation.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10

2017-03-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18138:

Labels: releasenotes  (was: )

> More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
> 
>
> Key: SPARK-18138
> URL: https://issues.apache.org/jira/browse/SPARK-18138
> Project: Spark
>  Issue Type: Task
>Reporter: Reynold Xin
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 2.1.0
>
>
> Plan:
> - Mark it very explicit in Spark 2.1.0 that support for the aforementioned 
> environments are deprecated.
> - Remove support it Spark 2.2.0
> Also see mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Gesher updated SPARK-19764:
---

We're driving everything from Python.  It may be a bug that we're not getting 
the error to propagate up to the notebook - generally, we see exceptions.  When 
we ran the same job from the PySpark shell, we saw the stacktrace, so I'm 
inclined to point at something in the notebook stop that made it not propagate.

We're happy to investigate if you think it's useful.



> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets

2017-03-07 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19853:
-
Target Version/s: 2.2.0

> Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
> 
>
> Key: SPARK-19853
> URL: https://issues.apache.org/jira/browse/SPARK-19853
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Chris Bowden
>Priority: Trivial
>
> When using the KafkaSource with Structured Streaming, consumer assignments 
> are not what the user expects if startingOffsets is set to an explicit set of 
> topics/partitions in JSON where the topic(s) happen to have uppercase 
> characters. When StartingOffsets is constructed, the original string value 
> from options is transformed toLowerCase to make matching on "earliest" and 
> "latest" case insensitive. However, the toLowerCase json is passed to 
> SpecificOffsets for the terminal condition, so topic names may not be what 
> the user intended by the time assignments are made with the underlying 
> KafkaConsumer.
> From KafkaSourceProvider:
> {code}
> val startingOffsets = 
> caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase)
>  match {
> case Some("latest") => LatestOffsets
> case Some("earliest") => EarliestOffsets
> case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
> case None => LatestOffsets
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets

2017-03-07 Thread Chris Bowden (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bowden updated SPARK-19853:
-
Description: 
When using the KafkaSource with Structured Streaming, consumer assignments are 
not what the user expects if startingOffsets is set to an explicit set of 
topics/partitions in JSON where the topic(s) happen to have uppercase 
characters. When StartingOffsets is constructed, the original string value from 
options is transformed toLowerCase to make matching on "earliest" and "latest" 
case insensitive. However, the toLowerCase json is passed to SpecificOffsets 
for the terminal condition, so topic names may not be what the user intended by 
the time assignments are made with the underlying KafkaConsumer.

>From KafkaSourceProvider:
{code}
val startingOffsets = 
caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) 
match {
case Some("latest") => LatestOffsets
case Some("earliest") => EarliestOffsets
case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
case None => LatestOffsets
  }
{code}

  was:When using the KafkaSource with Structured Streaming, consumer 
assignments are not what the user expects if startingOffsets is set to an 
explicit set of topics/partitions in JSON where the topic(s) happen to have 
uppercase characters. When StartingOffsets is constructed, the original string 
value from options is transformed toLowerCase to make matching on "earliest" 
and "latest" case insensitive. However, the toLowerCase json is passed to 
SpecificOffsets for the terminal condition, so topic names may not be what the 
user intended by the time assignments are made with the underlying 
KafkaConsumer.


> Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
> 
>
> Key: SPARK-19853
> URL: https://issues.apache.org/jira/browse/SPARK-19853
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Chris Bowden
>Priority: Trivial
>
> When using the KafkaSource with Structured Streaming, consumer assignments 
> are not what the user expects if startingOffsets is set to an explicit set of 
> topics/partitions in JSON where the topic(s) happen to have uppercase 
> characters. When StartingOffsets is constructed, the original string value 
> from options is transformed toLowerCase to make matching on "earliest" and 
> "latest" case insensitive. However, the toLowerCase json is passed to 
> SpecificOffsets for the terminal condition, so topic names may not be what 
> the user intended by the time assignments are made with the underlying 
> KafkaConsumer.
> From KafkaSourceProvider:
> {code}
> val startingOffsets = 
> caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase)
>  match {
> case Some("latest") => LatestOffsets
> case Some("earliest") => EarliestOffsets
> case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
> case None => LatestOffsets
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets

2017-03-07 Thread Chris Bowden (JIRA)
Chris Bowden created SPARK-19853:


 Summary: Uppercase Kafka topics fail when startingOffsets are 
SpecificOffsets
 Key: SPARK-19853
 URL: https://issues.apache.org/jira/browse/SPARK-19853
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Chris Bowden
Priority: Trivial


When using the KafkaSource with Structured Streaming, consumer assignments are 
not what the user expects if startingOffsets is set to an explicit set of 
topics/partitions in JSON where the topic(s) happen to have uppercase 
characters. When StartingOffsets is constructed, the original string value from 
options is transformed toLowerCase to make matching on "earliest" and "latest" 
case insensitive. However, the toLowerCase json is passed to SpecificOffsets 
for the terminal condition, so topic names may not be what the user intended by 
the time assignments are made with the underlying KafkaConsumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Chris Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900220#comment-15900220
 ] 

Chris Rogers edited comment on SPARK-16207 at 3/7/17 9:52 PM:
--

[~srowen] since there is no documentation yet, I don't know whether a clear, 
coherent generalization can be made.  I would be happy with "most of the 
methods DO NOT preserve order, with these specific exceptions", or "most of the 
methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation 
of ordering semantics, which seems like a very minimal amount of copy-pasting 
("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good 
place to start, since there seems to be some confusion about what the 
generalization would be.

I'm new to Scala so not sure if this is practical, but maybe the appropriate 
methods could be moved to an `RDDPreservesOrdering` class with an implicit 
conversion, akin to `PairRDDFunctions`?


was (Author: rcrogers):
[~srowen] since there is no documentation yet, I don't know whether a clear, 
coherent generalization can be made.  I would be happy with "most of the 
methods DO NOT preserve order, with these specific exceptions", or "most of the 
methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation 
of ordering semantics, which seems like a very minimal amount of copy-pasting 
("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good 
place to start, since there seems to be some confusion about what the 
generalization would be.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Chris Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900220#comment-15900220
 ] 

Chris Rogers commented on SPARK-16207:
--

[~srowen] since there is no documentation yet, I don't know whether a clear, 
coherent generalization can be made.  I would be happy with "most of the 
methods DO NOT preserve order, with these specific exceptions", or "most of the 
methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation 
of ordering semantics, which seems like a very minimal amount of copy-pasting 
("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good 
place to start, since there seems to be some confusion about what the 
generalization would be.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-07 Thread Nick Afshartous (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900216#comment-15900216
 ] 

Nick Afshartous commented on SPARK-19767:
-

Missed that one, thanks.  

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19702) Increasse refuse_seconds timeout in the Mesos Spark Dispatcher

2017-03-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19702:
-

 Assignee: Michael Gummelt
 Priority: Minor  (was: Major)
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/17031

> Increasse refuse_seconds timeout in the Mesos Spark Dispatcher
> --
>
> Key: SPARK-19702
> URL: https://issues.apache.org/jira/browse/SPARK-19702
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
>Priority: Minor
> Fix For: 2.2.0
>
>
> Due to the problem described here: 
> https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos 
> frameworks concurrently can result in starvation.  For example, running 10 
> dispatchers could result in 5 of them getting all the offers, even if they 
> have no jobs to launch.  We must implement increase the refuse_seconds 
> timeout to solve this problem.  Another option would have been to implement 
> suppress/revive, but that can cause starvation due to the unreliability of 
> mesos RPC calls.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900207#comment-15900207
 ] 

Sean Owen commented on SPARK-19767:
---

Oh, are you not running from the {{docs/}} directory?

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-07 Thread Nick Afshartous (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900184#comment-15900184
 ] 

Nick Afshartous commented on SPARK-19767:
-

Yes, I completed the steps in the Prerequisites section of  
https://github.com/apache/spark/blob/master/docs/README.md 
and got the same error about unknown tag {{include_example}} on two different 
computers (Linux and OSX).  

Looks like {{include_example}} is local in 
{{./docs/_plugins/include_example.rb}}, so maybe this is some kind of path 
issue where its not finding the local file ?  

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch

2017-03-07 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19561.

   Resolution: Fixed
 Assignee: Jason White
Fix Version/s: 2.2.0
   2.1.1

> Pyspark Dataframes don't allow timestamps near epoch
> 
>
> Key: SPARK-19561
> URL: https://issues.apache.org/jira/browse/SPARK-19561
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 2.1.1, 2.2.0
>
>
> Pyspark does not allow timestamps at or near the epoch to be created in a 
> DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299
> TimestampType.toInternal converts a datetime object to a number representing 
> microseconds since the epoch. For all times more than 2148 seconds before or 
> after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J 
> automatically serializes it as a long.
> However, for times within this range (~35 minutes before or after the epoch), 
> Py4J serializes it as an int. When creating the object on the Scala side, 
> ints are not recognized and the value goes to null. This leads to null values 
> in non-nullable fields, and corrupted Parquet files.
> The solution is trivial - force TimestampType.toInternal to always return a 
> long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900162#comment-15900162
 ] 

Sean Owen commented on SPARK-16207:
---

[~rcrogers] where would you document this? we could add a document in a place 
that would have helped you. However as I say, lots of things don't preserve 
order, so I don't know if it's sensible to write that everywhere.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Chris Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900152#comment-15900152
 ] 

Chris Rogers commented on SPARK-16207:
--

The lack of documentation on this is immensely confusing.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900127#comment-15900127
 ] 

Shixiong Zhu commented on SPARK-19764:
--

So you don't set an UncaughtExceptionHandler and this OOM happened on the 
driver side? If so, then it's not a Spark bug.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19851:
-
Component/s: (was: Spark Core)

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Labels: flaky-test  (was: )

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Affects Version/s: (was: 2.3.0)
   2.2.0

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Component/s: Tests

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19803.

   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0

Thanks for fixing this [~uncleGen] and for reporting it [~sitalke...@gmail.com]

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19516) update public doc to use SparkSession instead of SparkContext

2017-03-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19516.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16856
[https://github.com/apache/spark/pull/16856]

> update public doc to use SparkSession instead of SparkContext
> -
>
> Key: SPARK-19516
> URL: https://issues.apache.org/jira/browse/SPARK-19516
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1599#comment-1599
 ] 

Ari Gesher commented on SPARK-19764:


We were collecting more data than we had heap for. Still useful?

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-07 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19852:
-

 Summary: StringIndexer.setHandleInvalid should have another option 
'new': Python API and docs
 Key: SPARK-19852
 URL: https://issues.apache.org/jira/browse/SPARK-19852
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Priority: Minor


Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
will probably require:
* putting HandleInvalid within StringIndexer to update its built-in doc (See 
Bucketizer for an example.)
* updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-03-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17498.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16883
[https://github.com/apache/spark/pull/16883]

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>Assignee: Vincent
>Priority: Minor
> Fix For: 2.2.0
>
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-19851:
---
Description: 
Add support for EVERY and ANY (SOME) aggregates.

- EVERY returns true if all input values are true.
- ANY returns true if at least one input value is true.
- SOME is equivalent to ANY.

Both aggregates are part of the SQL standard.

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899988#comment-15899988
 ] 

Michael Styles commented on SPARK-19851:


https://github.com/apache/spark/pull/17194

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899979#comment-15899979
 ] 

Apache Spark commented on SPARK-19348:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/17193

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Vinayak Joshi
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899980#comment-15899980
 ] 

Shixiong Zhu commented on SPARK-19764:
--

[~agesher] Do you have the OOM stack trace? So that we can fix it.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)
Michael Styles created SPARK-19851:
--

 Summary: Add support for EVERY and ANY (SOME) aggregates
 Key: SPARK-19851
 URL: https://issues.apache.org/jira/browse/SPARK-19851
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Michael Styles






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Gesher resolved SPARK-19764.

Resolution: Not A Bug

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899964#comment-15899964
 ] 

Ari Gesher commented on SPARK-19764:


We narrowed this down to driver OOM that wasn't being properly propagated into 
our Jupyter Notebook.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18549.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.2.0

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table

2017-03-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19765.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> UNCACHE TABLE should also un-cache all cached plans that refer to this table
> 
>
> Key: SPARK-19765
> URL: https://issues.apache.org/jira/browse/SPARK-19765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes
> Fix For: 2.2.0
>
>
> DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, 
> UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all 
> the cached plans that refer to this table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >