[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large
[ https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900871#comment-15900871 ] Cui Xixin commented on SPARK-6951: -- In my case,the inprogress file is the main reason, so if the inprogress file is necessary for the history server, or if the inprogress file can be cleaned up with the 'spark.history.fs.cleaner.maxAge' > History server slow startup if the event log directory is large > --- > > Key: SPARK-6951 > URL: https://issues.apache.org/jira/browse/SPARK-6951 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Matt Cheah > > I started my history server, then navigated to the web UI where I expected to > be able to view some completed applications, but the webpage was not > available. It turned out that the History Server was not finished parsing all > of the event logs in the event log directory that I had specified. I had > accumulated a lot of event logs from months of running Spark, so it would > have taken a very long time for the History Server to crunch through them > all. I purged the event log directory and started from scratch, and the UI > loaded immediately. > We should have a pagination strategy or parse the directory lazily to avoid > needing to wait after starting the history server. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories
[ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900821#comment-15900821 ] Saisai Shao edited comment on SPARK-19812 at 3/8/17 7:42 AM: - [~tgraves], I'm not quite sure what you mean here? bq. The tests are using files rather then directories so it didn't catch. We need to fix the test also. >From my understanding this issues happens when dest dir is not empty and try >to move with REPLACE_EXISTING. Also be happened when calling rename failed and >the source dir is not empty directory. But I cannot imagine how this happened, from the log it is more like a `rename` failure issue, since the path in Exception points to source dir. was (Author: jerryshao): [~tgraves], I'm not quite sure what you mean here? bq. The tests are using files rather then directories so it didn't catch. We need to fix the test also. >From my understanding this issues happens when dest dir is not empty and try >to move with REPLACE_EXISTING. Also be happened when calling rename failed and >the source dir is not empty directory. But I cannot imagine how this happened, because if dest dir is not empty, then it should be returned before, will not go to check old NM local dirs. > YARN shuffle service fails to relocate recovery DB directories > -- > > Key: SPARK-19812 > URL: https://issues.apache.org/jira/browse/SPARK-19812 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.1 >Reporter: Thomas Graves >Assignee: Thomas Graves > > The yarn shuffle service tries to switch from the yarn local directories to > the real recovery directory but can fail to move the existing recovery db's. > It fails due to Files.move not doing directories that have contents. > 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move > recovery file sparkShuffleRecovery.ldb to the path > /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle > java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb > at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) > at > sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) > at java.nio.file.Files.move(Files.java:1395) > at > org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) > at > org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) > This used to use f.renameTo and we switched it in the pr due to review > comments and it looks like didn't do a final real test. The tests are using > files rather then directories so it didn't catch. We need to fix the test > also. > history: > https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories
[ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900821#comment-15900821 ] Saisai Shao edited comment on SPARK-19812 at 3/8/17 7:40 AM: - [~tgraves], I'm not quite sure what you mean here? bq. The tests are using files rather then directories so it didn't catch. We need to fix the test also. >From my understanding this issues happens when dest dir is not empty and try >to move with REPLACE_EXISTING. Also be happened when calling rename failed and >the source dir is not empty directory. But I cannot imagine how this happened, because if dest dir is not empty, then it should be returned before, will not go to check old NM local dirs. was (Author: jerryshao): [~tgraves], I'm not quite sure what you mean here? bq. The tests are using files rather then directories so it didn't catch. We need to fix the test also. >From my understanding this issues happens when dest dir is not empty and try >to move with REPLACE_EXISTING, but I cannot imagine how this happened, because >if dest dir is not empty, then it should be returned before, will not go to >check old NM local dirs. > YARN shuffle service fails to relocate recovery DB directories > -- > > Key: SPARK-19812 > URL: https://issues.apache.org/jira/browse/SPARK-19812 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.1 >Reporter: Thomas Graves >Assignee: Thomas Graves > > The yarn shuffle service tries to switch from the yarn local directories to > the real recovery directory but can fail to move the existing recovery db's. > It fails due to Files.move not doing directories that have contents. > 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move > recovery file sparkShuffleRecovery.ldb to the path > /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle > java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb > at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) > at > sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) > at java.nio.file.Files.move(Files.java:1395) > at > org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) > at > org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) > This used to use f.renameTo and we switched it in the pr due to review > comments and it looks like didn't do a final real test. The tests are using > files rather then directories so it didn't catch. We need to fix the test > also. > history: > https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle
[ https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900825#comment-15900825 ] Nick Pentreath commented on SPARK-13969: I think {{HashingTF}} and {{FeatureHasher}} are different things - similar to HashingVectorizer and FeatureHasher in scikit-learn. {{HashingTF}} (HashingVectorizer) transforms a Seq of terms (typically {{String}}) into a term frequency vector. Yes technically it can operate on Seq of any type (well actually only strings and numbers, see [murmur3Hash function|https://github.com/apache/spark/blob/60022bfd65e4637efc0eb5f4cc0112289c783147/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala#L151]). It could certainly operate on multiple columns - that would hash all the columns of sentences into one term frequency vector, so it seems like it would probably be less used in practice (though Vowpal Wabbit supports a form of this with its namespaces). What {{HashingTF}} does not support is arbitrary categorical or numeric columns. It is possible to support categorical "one-hot" style encoding using what I have come to call the "stringify hack" - transforming a set of categorical columns into a Seq for input to HashingTF. So taking say two categorical columns {{city}} and {{state}}, for example: {code} ++-+-+ |city|state|stringified | ++-+-+ |Boston |MA |[city=Boston, state=MA] | |New York|NY |[city=New York, state=NY]| ++-+-+ {code} This works but is pretty ugly, doesn't fit nicely into a pipeline, and can't support numeric columns. The {{FeatureHasher}} I propose acts like that in scikit-learn - it can handle multiple numeric and/or categorical columns in one pass. I go into some detail about all of this in my [Spark Summit East 2017 talk|https://www.slideshare.net/SparkSummit/feature-hashing-for-scalable-machine-learning-spark-summit-east-talk-by-nick-pentreath]. The rough draft of it used for the talk is [here|https://github.com/MLnick/spark/blob/FeatureHasher/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala]. Another nice thing about the {{FeatureHasher}} is it opens up possibilities for doing things like namespaces in Vowpal Wabbit and it would be interesting to see if we could mimic their internal feature crossing, and so on. > Extend input format that feature hashing can handle > --- > > Key: SPARK-13969 > URL: https://issues.apache.org/jira/browse/SPARK-13969 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > > Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in > scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of > strings and computes term frequencies. > The use cases for feature hashing extend to arbitrary feature values (binary, > count or real-valued). For example, scikit-learn's {{FeatureHasher}} can > accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this > way, feature hashing can operate as both "one-hot encoder" and "vector > assembler" at the same time. > Investigate adding a more generic feature hasher (that in turn can be used by > {{HashingTF}}). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900824#comment-15900824 ] Takeshi Yamamuro commented on SPARK-15463: -- Have you seen https://github.com/apache/spark/pull/13300#issuecomment-261156734 as related discussion? Currently, I think [~hyukjin.kwon]'s idea is more preferable: https://github.com/apache/spark/pull/16854#issue-206224691. > Support for creating a dataframe from CSV in Dataset[String] > > > Key: SPARK-15463 > URL: https://issues.apache.org/jira/browse/SPARK-15463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: PJ Fanning > > I currently use Databrick's spark-csv lib but some features don't work with > Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV > support into spark-sql directly, that spark-csv won't be modified. > I currently read some CSV data that has been pre-processed and is in > RDD[String] format. > There is sqlContext.read.json(rdd: RDD[String]) but other formats don't > appear to support the creation of DataFrames based on loading from > RDD[String]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories
[ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900821#comment-15900821 ] Saisai Shao commented on SPARK-19812: - [~tgraves], I'm not quite sure what you mean here? bq. The tests are using files rather then directories so it didn't catch. We need to fix the test also. >From my understanding this issues happens when dest dir is not empty and try >to move with REPLACE_EXISTING, but I cannot imagine how this happened, because >if dest dir is not empty, then it should be returned before, will not go to >check old NM local dirs. > YARN shuffle service fails to relocate recovery DB directories > -- > > Key: SPARK-19812 > URL: https://issues.apache.org/jira/browse/SPARK-19812 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.1 >Reporter: Thomas Graves >Assignee: Thomas Graves > > The yarn shuffle service tries to switch from the yarn local directories to > the real recovery directory but can fail to move the existing recovery db's. > It fails due to Files.move not doing directories that have contents. > 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move > recovery file sparkShuffleRecovery.ldb to the path > /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle > java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb > at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) > at > sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) > at java.nio.file.Files.move(Files.java:1395) > at > org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) > at > org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) > This used to use f.renameTo and we switched it in the pr due to review > comments and it looks like didn't do a final real test. The tests are using > files rather then directories so it didn't catch. We need to fix the test > also. > history: > https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
[ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900773#comment-15900773 ] Apache Spark commented on SPARK-19843: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/17205 > UTF8String => (int / long) conversion expensive for invalid inputs > -- > > Key: SPARK-19843 > URL: https://issues.apache.org/jira/browse/SPARK-19843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.2.0 > > > In case of invalid inputs, converting a UTF8String to int or long returns > null. This comes at a cost wherein the method for conversion (e.g [0]) would > throw an exception. Exception handling is expensive as it will convert the > UTF8String into a java string, populate the stack trace (which is a native > call). While migrating workloads from Hive -> Spark, I see that this at an > aggregate level affects the performance of queries in comparison with hive. > The exception is just indicating that the conversion failed.. its not > propagated to users so it would be good to avoid. > Couple of options: > - Return Integer / Long (instead of primitive types) which can be set to NULL > if the conversion fails. This is boxing and super bad for perf so a big no. > - Hive has a pre-check [1] for this which is not a perfect safety net but > helpful to capture typical bad inputs eg. empty string, "null". > [0] : > https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 > [1] : > https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900741#comment-15900741 ] jin xing edited comment on SPARK-19659 at 3/8/17 5:47 AM: -- [~irashid] [~rxin] I uploaded SPARK-19659-design-v2.pdf, please take a look. Yes, only outliers should be tracked and MapStatus should stay compact. It is a good idea to track all the sizes that are more than 2x the average. “2x” is a default value, it will be configurable by parameter. We need to collect metrics for stability and performance: - For stability: 1) Blocks which have size between the average and 2x the average are underestimated. So there is a risk that those blocks cannot fit in memory. In this approach, driver should calculate the sum of underestimated sizes. 2) Show block sizes’ distribution of MapStatus, the distribution will between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a lot for debugging(e.g. find skew situations). * For performance: 1) Peak memory(off heap or on heap) used for fetching blocks should be tracked. 2) Fetching blocks to disk will cause performance degradation. Executor should calculate the size of blocks shuffled to disk and the time cost. Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and prs). The metrics part should be done first before other parts proposed here. What's more, why not make 2000 as a configurable parameter, thus user can chose the track the accurate sizes of blocks to some level. was (Author: jinxing6...@126.com): [~irashid] [~rxin] I uploaded SPARK-19659-design-v2.pdf, please take a look. Yes, only outliers should be tracked and MapStatus should stay compact. It is a good idea to track all the sizes that are more than 2x the average. “2x” is a default value, it will be configurable by parameter. We need to collect metrics for stability and performance: - For stability: 1) Blocks which have size between the average and 2x the average are underestimated. So there is a risk that those blocks cannot fit in memory. In this approach, driver should calculate the sum of underestimated sizes. 2) Show block sizes’ distribution of MapStatus, the distribution will between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a lot for debugging(e.g. find skew situations). * For performance: 1) Memory(off heap or on heap) used for fetching blocks should be tracked. 2) Fetching blocks to disk will cause performance degradation. Executor should calculate the size of blocks shuffled to disk and the time cost. Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and prs). The metrics part should be done first before other parts proposed here. What's more, why not make 2000 as a configurable parameter, thus user can chose the track the accurate sizes of blocks to some level. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900741#comment-15900741 ] jin xing commented on SPARK-19659: -- [~irashid] [~rxin] I uploaded SPARK-19659-design-v2.pdf, please take a look. Yes, only outliers should be tracked and MapStatus should stay compact. It is a good idea to track all the sizes that are more than 2x the average. “2x” is a default value, it will be configurable by parameter. We need to collect metrics for stability and performance: - For stability: 1) Blocks which have size between the average and 2x the average are underestimated. So there is a risk that those blocks cannot fit in memory. In this approach, driver should calculate the sum of underestimated sizes. 2) Show block sizes’ distribution of MapStatus, the distribution will between(0~100K, 100K~1M, 1~10M, 10~100M, 100M~1G, 1G~10G), which will help a lot for debugging(e.g. find skew situations). * For performance: 1) Memory(off heap or on heap) used for fetching blocks should be tracked. 2) Fetching blocks to disk will cause performance degradation. Executor should calculate the size of blocks shuffled to disk and the time cost. Yes, Imran, I will definitely break the proposal to smaller pieces(jiras and prs). The metrics part should be done first before other parts proposed here. What's more, why not make 2000 as a configurable parameter, thus user can chose the track the accurate sizes of blocks to some level. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-19659: - Attachment: SPARK-19659-design-v2.pdf > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.
[ https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuchang updated SPARK-19860: Description: {code} >>> print df1.collect() [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), Row(fdate=u'20170301', in_amount1=10159653)] >>> print df2.collect() [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), Row(fdate=u'20170301', in_amount2=9475418)] >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner') 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases. Traceback (most recent call last): File "", line 1, in File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join jdf = self._jdf.join(other._jdf, on._jc, how) File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u" Failure when resolving conflicting references in Join: 'Join Inner, (fdate#42 = fdate#42) :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount1#97] : +- Filter (inorout#44 = A) : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42] :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) : +- SubqueryAlias history_transfer_v : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, bankwaterid#48, waterid#49, waterstate#50, source#51] : +- SubqueryAlias history_transfer :+- Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] parquet +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount2#145] +- Filter (inorout#44 = B) +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42] +- Filter (((partnerid#45 = pmec) && NOT (f
[jira] [Resolved] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19348. --- Resolution: Fixed Fix Version/s: 2.0.3 2.1.1 Issue resolved by pull request 17195 [https://github.com/apache/spark/pull/17195] > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Vinayak Joshi >Assignee: Bryan Cutler > Fix For: 2.1.1, 2.0.3, 2.2.0 > > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API
[ https://issues.apache.org/jira/browse/SPARK-19866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19866: -- Shepherd: Joseph K. Bradley > Add local version of Word2Vec findSynonyms for spark.ml: Python API > --- > > Key: SPARK-19866 > URL: https://issues.apache.org/jira/browse/SPARK-19866 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Add Python API for findSynonymsArray matching Scala API in linked JIRA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API
Joseph K. Bradley created SPARK-19866: - Summary: Add local version of Word2Vec findSynonyms for spark.ml: Python API Key: SPARK-19866 URL: https://issues.apache.org/jira/browse/SPARK-19866 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Priority: Minor Add Python API for findSynonymsArray matching Scala API in linked JIRA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17629) Add local version of Word2Vec findSynonyms for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-17629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-17629. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16811 [https://github.com/apache/spark/pull/16811] > Add local version of Word2Vec findSynonyms for spark.ml > --- > > Key: SPARK-17629 > URL: https://issues.apache.org/jira/browse/SPARK-17629 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Asher Krim >Assignee: Asher Krim >Priority: Minor > Fix For: 2.2.0 > > > ml Word2Vec's findSynonyms methods depart from mllib in that they return > distributed results, rather than the results directly: > {code} > def findSynonyms(word: String, num: Int): DataFrame = { > val spark = SparkSession.builder().getOrCreate() > spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", > "similarity") > } > {code} > What was the reason for this decision? I would think that most users would > request a reasonably small number of results back, and want to use them > directly on the driver, similar to the _take_ method on dataframes. Returning > parallelized results creates a costly round trip for the data that doesn't > seem necessary. > The original PR: https://github.com/apache/spark/pull/7263 > [~MechCoder] - do you perhaps recall the reason? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19859. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
[ https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19841. -- Resolution: Fixed Fix Version/s: 2.2.0 > StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys > > > Key: SPARK-19841 > URL: https://issues.apache.org/jira/browse/SPARK-19841 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > Right now it just uses the rows to filter but a column position in > keyExpressions may be different than the position in row. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19865) remove the view identifier in SubqueryAlias
Wenchen Fan created SPARK-19865: --- Summary: remove the view identifier in SubqueryAlias Key: SPARK-19865 URL: https://issues.apache.org/jira/browse/SPARK-19865 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Jiang Xingbo Since we have a `View` node now, we can remove the view identifier in `SubqueryAlias`, which was used to indicate a view node before -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18389) Disallow cyclic view reference
[ https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-18389: --- Assignee: Jiang Xingbo > Disallow cyclic view reference > -- > > Key: SPARK-18389 > URL: https://issues.apache.org/jira/browse/SPARK-18389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > The following should not be allowed: > {code} > CREATE VIEW testView AS SELECT id FROM jt > CREATE VIEW testView2 AS SELECT id FROM testView > ALTER VIEW testView AS SELECT * FROM testView2 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18389) Disallow cyclic view reference
[ https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18389. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17152 [https://github.com/apache/spark/pull/17152] > Disallow cyclic view reference > -- > > Key: SPARK-18389 > URL: https://issues.apache.org/jira/browse/SPARK-18389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 2.2.0 > > > The following should not be allowed: > {code} > CREATE VIEW testView AS SELECT id FROM jt > CREATE VIEW testView2 AS SELECT id FROM testView > ALTER VIEW testView AS SELECT * FROM testView2 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code
[ https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19864: Assignee: (was: Apache Spark) > add makeQualifiedPath in CatalogUtils to optimize some code > --- > > Key: SPARK-19864 > URL: https://issues.apache.org/jira/browse/SPARK-19864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > Currently there are lots of places to make the path qualified, it is better > to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code
[ https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900675#comment-15900675 ] Apache Spark commented on SPARK-19864: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/17204 > add makeQualifiedPath in CatalogUtils to optimize some code > --- > > Key: SPARK-19864 > URL: https://issues.apache.org/jira/browse/SPARK-19864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > Currently there are lots of places to make the path qualified, it is better > to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code
[ https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19864: Assignee: Apache Spark > add makeQualifiedPath in CatalogUtils to optimize some code > --- > > Key: SPARK-19864 > URL: https://issues.apache.org/jira/browse/SPARK-19864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Apache Spark >Priority: Minor > > Currently there are lots of places to make the path qualified, it is better > to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
[ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19843: --- Assignee: Tejas Patil > UTF8String => (int / long) conversion expensive for invalid inputs > -- > > Key: SPARK-19843 > URL: https://issues.apache.org/jira/browse/SPARK-19843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.2.0 > > > In case of invalid inputs, converting a UTF8String to int or long returns > null. This comes at a cost wherein the method for conversion (e.g [0]) would > throw an exception. Exception handling is expensive as it will convert the > UTF8String into a java string, populate the stack trace (which is a native > call). While migrating workloads from Hive -> Spark, I see that this at an > aggregate level affects the performance of queries in comparison with hive. > The exception is just indicating that the conversion failed.. its not > propagated to users so it would be good to avoid. > Couple of options: > - Return Integer / Long (instead of primitive types) which can be set to NULL > if the conversion fails. This is boxing and super bad for perf so a big no. > - Hive has a pre-check [1] for this which is not a perfect safety net but > helpful to capture typical bad inputs eg. empty string, "null". > [0] : > https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 > [1] : > https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
[ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19843. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17184 [https://github.com/apache/spark/pull/17184] > UTF8String => (int / long) conversion expensive for invalid inputs > -- > > Key: SPARK-19843 > URL: https://issues.apache.org/jira/browse/SPARK-19843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > Fix For: 2.2.0 > > > In case of invalid inputs, converting a UTF8String to int or long returns > null. This comes at a cost wherein the method for conversion (e.g [0]) would > throw an exception. Exception handling is expensive as it will convert the > UTF8String into a java string, populate the stack trace (which is a native > call). While migrating workloads from Hive -> Spark, I see that this at an > aggregate level affects the performance of queries in comparison with hive. > The exception is just indicating that the conversion failed.. its not > propagated to users so it would be good to avoid. > Couple of options: > - Return Integer / Long (instead of primitive types) which can be set to NULL > if the conversion fails. This is boxing and super bad for perf so a big no. > - Hive has a pre-check [1] for this which is not a perfect safety net but > helpful to capture typical bad inputs eg. empty string, "null". > [0] : > https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 > [1] : > https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code
Song Jun created SPARK-19864: Summary: add makeQualifiedPath in CatalogUtils to optimize some code Key: SPARK-19864 URL: https://issues.apache.org/jira/browse/SPARK-19864 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor Currently there are lots of places to make the path qualified, it is better to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19863: Assignee: Apache Spark > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming application > > > Key: SPARK-19863 > URL: https://issues.apache.org/jira/browse/SPARK-19863 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 2.1.0 >Reporter: LvDongrong >Assignee: Apache Spark > > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming > application. In Spark 2.x, the kafka consumer was replaced by > CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka > cluster), and cannot change the way. In fact ,The KafkaRDD(used by > DirectKafkaInputDStream to connect kafka) provide the parameter > useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the > DirectKafkaInputDStream set the parameter true. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19863: Assignee: (was: Apache Spark) > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming application > > > Key: SPARK-19863 > URL: https://issues.apache.org/jira/browse/SPARK-19863 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 2.1.0 >Reporter: LvDongrong > > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming > application. In Spark 2.x, the kafka consumer was replaced by > CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka > cluster), and cannot change the way. In fact ,The KafkaRDD(used by > DirectKafkaInputDStream to connect kafka) provide the parameter > useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the > DirectKafkaInputDStream set the parameter true. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900669#comment-15900669 ] Apache Spark commented on SPARK-19863: -- User 'lvdongr' has created a pull request for this issue: https://github.com/apache/spark/pull/17203 > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming application > > > Key: SPARK-19863 > URL: https://issues.apache.org/jira/browse/SPARK-19863 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 2.1.0 >Reporter: LvDongrong > > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming > application. In Spark 2.x, the kafka consumer was replaced by > CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka > cluster), and cannot change the way. In fact ,The KafkaRDD(used by > DirectKafkaInputDStream to connect kafka) provide the parameter > useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the > DirectKafkaInputDStream set the parameter true. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
LvDongrong created SPARK-19863: -- Summary: Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application Key: SPARK-19863 URL: https://issues.apache.org/jira/browse/SPARK-19863 Project: Spark Issue Type: Bug Components: DStreams, Input/Output Affects Versions: 2.1.0 Reporter: LvDongrong Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application. In Spark 2.x, the kafka consumer was replaced by CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka cluster), and cannot change the way. In fact ,The KafkaRDD(used by DirectKafkaInputDStream to connect kafka) provide the parameter useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the DirectKafkaInputDStream set the parameter true. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle
[ https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900662#comment-15900662 ] Joseph K. Bradley commented on SPARK-13969: --- Noticing this JIRA again. I feel like this is partly solved: * HashingTF can handle pretty much arbitrary types. * HashingTF should have identical behavior across languages and JVM versions. (These first 2 are because HashingTF converts the input to StringType and then hashes it using MurmurHash3.) with 1 remaining item: * HashingTF does not take multiple input columns. That would effectively make it handle VectorAssembler's job as well. What are your thoughts here? Should we just stick with VectorAssembler? > Extend input format that feature hashing can handle > --- > > Key: SPARK-13969 > URL: https://issues.apache.org/jira/browse/SPARK-13969 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > > Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in > scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of > strings and computes term frequencies. > The use cases for feature hashing extend to arbitrary feature values (binary, > count or real-valued). For example, scikit-learn's {{FeatureHasher}} can > accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this > way, feature hashing can operate as both "one-hot encoder" and "vector > assembler" at the same time. > Investigate adding a more generic feature hasher (that in turn can be used by > {{HashingTF}}). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.
guoxiaolong created SPARK-19862: --- Summary: In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. Key: SPARK-19862 URL: https://issues.apache.org/jira/browse/SPARK-19862 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: guoxiaolong Priority: Minor "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be deleted. Because it is the same of "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19861) watermark should not be a negative time.
[ https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900631#comment-15900631 ] Apache Spark commented on SPARK-19861: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17202 > watermark should not be a negative time. > > > Key: SPARK-19861 > URL: https://issues.apache.org/jira/browse/SPARK-19861 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.
[ https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19861: Assignee: (was: Apache Spark) > watermark should not be a negative time. > > > Key: SPARK-19861 > URL: https://issues.apache.org/jira/browse/SPARK-19861 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.
[ https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19861: Assignee: Apache Spark > watermark should not be a negative time. > > > Key: SPARK-19861 > URL: https://issues.apache.org/jira/browse/SPARK-19861 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19861) watermark should not be a negative time.
Genmao Yu created SPARK-19861: - Summary: watermark should not be a negative time. Key: SPARK-19861 URL: https://issues.apache.org/jira/browse/SPARK-19861 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing
[ https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900608#comment-15900608 ] Takeshi Yamamuro commented on SPARK-18359: -- Since JDK9 use CLDR as locale by default, it seems good to explicitly handle it: http://openjdk.java.net/jeps/252 > Let user specify locale in CSV parsing > -- > > Key: SPARK-18359 > URL: https://issues.apache.org/jira/browse/SPARK-18359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: yannick Radji > > On the DataFrameReader object there no CSV-specific option to set decimal > delimiter on comma whereas dot like it use to be in France and Europe. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.
wuchang created SPARK-19860: --- Summary: DataFrame join get conflict error if two frames has a same name column. Key: SPARK-19860 URL: https://issues.apache.org/jira/browse/SPARK-19860 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: wuchang >>> print df1.collect() [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), Row(fdate=u'20170301', in_amount1=10159653)] >>> print df2.collect() [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), Row(fdate=u'20170301', in_amount2=9475418)] >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner') 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases. Traceback (most recent call last): File "", line 1, in File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join jdf = self._jdf.join(other._jdf, on._jc, how) File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u" Failure when resolving conflicting references in Join: 'Join Inner, (fdate#42 = fdate#42) :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount1#97] : +- Filter (inorout#44 = A) : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42] :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) : +- SubqueryAlias history_transfer_v : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, bankwaterid#48, waterid#49, waterstate#50, source#51] : +- SubqueryAlias history_transfer :+- Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] parquet +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int)
[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18055: Assignee: Michael Armbrust (was: Apache Spark) > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900568#comment-15900568 ] Apache Spark commented on SPARK-18055: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/17201 > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18055: Assignee: Apache Spark (was: Michael Armbrust) > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Apache Spark > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18055: - Target Version/s: 2.2.0 > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19810) Remove support for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900548#comment-15900548 ] Min Shen commented on SPARK-19810: -- [~srowen], Want to get an idea regarding the timeline for removing Scala 2.10. We have heavy usage of Spark at LinkedIn, and we are right now still deploying Spark built with Scala 2.10 due to various dependencies on other systems we have which still rely on Scala 2.10. While we also have plans to upgrade our various internal systems to start using Scala 2.11, it will take a while for that to happen. In the mean time, if support for Scala 2.10 is removed in Spark 2.2, this is going to potentially block us from upgrading to Spark 2.2+ while we haven't fully moved off Scala 2.10 yet. Want to raise this concern here and also to understand the timeline for removing Scala 2.10 in Spark. > Remove support for Scala 2.10 > - > > Key: SPARK-19810 > URL: https://issues.apache.org/jira/browse/SPARK-19810 > Project: Spark > Issue Type: Task > Components: ML, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Critical > > This tracks the removal of Scala 2.10 support, as discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html > and other lists. > The primary motivations are to simplify the code and build, and to enable > Scala 2.12 support later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-18055: Assignee: Michael Armbrust > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)
[ https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900533#comment-15900533 ] Jim Kleckner commented on SPARK-16333: -- I ended up here when looking into why an upgrade of our streaming computation to 2.1.0 was pegging the network at a gigabit/second. Setting to spark.eventLog.enabled to false confirmed that this logging from slave port 50010 was the culprit. How can anyone with seriously large numbers of tasks use spark history with this amount of load? > Excessive Spark history event/json data size (5GB each) > --- > > Key: SPARK-16333 > URL: https://issues.apache.org/jira/browse/SPARK-16333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 > Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) > and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server > release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build) >Reporter: Peter Liu > Labels: performance, spark2.0.0 > > With Spark2.0.0-preview (May-24 build), the history event data (the json > file), that is generated for each Spark application run (see below), can be > as big as 5GB (instead of 14 MB for exactly the same application run and the > same input data of 1TB under Spark1.6.1) > -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959- > -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213- > -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856- > -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556- > The test is done with Sparkbench V2, SQL RDD (see github: > https://github.com/SparkTC/spark-bench) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch
[ https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900524#comment-15900524 ] Apache Spark commented on SPARK-19561: -- User 'JasonMWhite' has created a pull request for this issue: https://github.com/apache/spark/pull/17200 > Pyspark Dataframes don't allow timestamps near epoch > > > Key: SPARK-19561 > URL: https://issues.apache.org/jira/browse/SPARK-19561 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Jason White >Assignee: Jason White > Fix For: 2.1.1, 2.2.0 > > > Pyspark does not allow timestamps at or near the epoch to be created in a > DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299 > TimestampType.toInternal converts a datetime object to a number representing > microseconds since the epoch. For all times more than 2148 seconds before or > after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J > automatically serializes it as a long. > However, for times within this range (~35 minutes before or after the epoch), > Py4J serializes it as an int. When creating the object on the Scala side, > ints are not recognized and the value goes to null. This leads to null values > in non-nullable fields, and corrupted Parquet files. > The solution is trivial - force TimestampType.toInternal to always return a > long. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
[ https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900463#comment-15900463 ] Vincent commented on SPARK-19852: - I can work on this issue, since it is related to SPARK-17498 > StringIndexer.setHandleInvalid should have another option 'new': Python API > and docs > > > Key: SPARK-19852 > URL: https://issues.apache.org/jira/browse/SPARK-19852 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Update Python API for StringIndexer so setHandleInvalid doc is correct. This > will probably require: > * putting HandleInvalid within StringIndexer to update its built-in doc (See > Bucketizer for an example.) > * updating API docs and maybe the guide -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19857. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.2.0 2.1.1 > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.1.1, 2.2.0 > > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900410#comment-15900410 ] Apache Spark commented on SPARK-19859: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17199 > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19859: Assignee: Shixiong Zhu (was: Apache Spark) > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19859: Assignee: Apache Spark (was: Shixiong Zhu) > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19859) The new watermark should override the old one
Shixiong Zhu created SPARK-19859: Summary: The new watermark should override the old one Key: SPARK-19859 URL: https://issues.apache.org/jira/browse/SPARK-19859 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu The new watermark should override the old one. Otherwise, we just pick up the first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900377#comment-15900377 ] Apache Spark commented on SPARK-19858: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17197 > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19858: Assignee: Shixiong Zhu (was: Apache Spark) > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19858: Assignee: Apache Spark (was: Shixiong Zhu) > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
Shixiong Zhu created SPARK-19858: Summary: Add output mode to flatMapGroupsWithState and disallow invalid cases Key: SPARK-19858 URL: https://issues.apache.org/jira/browse/SPARK-19858 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900363#comment-15900363 ] Apache Spark commented on SPARK-19857: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/17198 > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19857: Assignee: (was: Apache Spark) > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19857: Assignee: Apache Spark > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
Marcelo Vanzin created SPARK-19857: -- Summary: CredentialUpdater calculates the wrong time for next update Key: SPARK-19857 URL: https://issues.apache.org/jira/browse/SPARK-19857 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.1.0 Reporter: Marcelo Vanzin This is the code: {code} val remainingTime = getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) - System.currentTimeMillis() {code} If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface
[ https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19855: Assignee: Reynold Xin (was: Apache Spark) > Create an internal FilePartitionStrategy interface > -- > > Key: SPARK-19855 > URL: https://issues.apache.org/jira/browse/SPARK-19855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19855) Create an internal FilePartitionStrategy interface
[ https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900315#comment-15900315 ] Apache Spark commented on SPARK-19855: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17196 > Create an internal FilePartitionStrategy interface > -- > > Key: SPARK-19855 > URL: https://issues.apache.org/jira/browse/SPARK-19855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface
[ https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19855: Assignee: Apache Spark (was: Reynold Xin) > Create an internal FilePartitionStrategy interface > -- > > Key: SPARK-19855 > URL: https://issues.apache.org/jira/browse/SPARK-19855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite from integration tests into unit tests
[ https://issues.apache.org/jira/browse/SPARK-19856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-19856: Summary: Turn partitioning related test cases in FileSourceStrategySuite from integration tests into unit tests (was: Turn partitioning related test cases in FileSourceStrategySuite into unit tests) > Turn partitioning related test cases in FileSourceStrategySuite from > integration tests into unit tests > -- > > Key: SPARK-19856 > URL: https://issues.apache.org/jira/browse/SPARK-19856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19855) Create an internal FilePartitionStrategy interface
Reynold Xin created SPARK-19855: --- Summary: Create an internal FilePartitionStrategy interface Key: SPARK-19855 URL: https://issues.apache.org/jira/browse/SPARK-19855 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite into unit tests
Reynold Xin created SPARK-19856: --- Summary: Turn partitioning related test cases in FileSourceStrategySuite into unit tests Key: SPARK-19856 URL: https://issues.apache.org/jira/browse/SPARK-19856 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19854) Refactor file partitioning strategy to make it easier to extend / unit test
Reynold Xin created SPARK-19854: --- Summary: Refactor file partitioning strategy to make it easier to extend / unit test Key: SPARK-19854 URL: https://issues.apache.org/jira/browse/SPARK-19854 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin The way we currently do file partitioning strategy is hard coded in FileSourceScanExec. This is not ideal for two reasons: 1. It is difficult to unit test the default strategy. In order to test this, we need to do almost end-to-end tests by creating actual files on the file system. 2. It is difficult to experiment with different partitioning strategies without adding a lot of if branches. The goal of this story is to create an internal interface for this so we can make this pluggable for both better testing and experimentation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18138: Labels: releasenotes (was: ) > More officially deprecate support for Python 2.6, Java 7, and Scala 2.10 > > > Key: SPARK-18138 > URL: https://issues.apache.org/jira/browse/SPARK-18138 > Project: Spark > Issue Type: Task >Reporter: Reynold Xin >Assignee: Sean Owen >Priority: Blocker > Labels: releasenotes > Fix For: 2.1.0 > > > Plan: > - Mark it very explicit in Spark 2.1.0 that support for the aforementioned > environments are deprecated. > - Remove support it Spark 2.2.0 > Also see mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Gesher updated SPARK-19764: --- We're driving everything from Python. It may be a bug that we're not getting the error to propagate up to the notebook - generally, we see exceptions. When we ran the same job from the PySpark shell, we saw the stacktrace, so I'm inclined to point at something in the notebook stop that made it not propagate. We're happy to investigate if you think it's useful. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
[ https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19853: - Target Version/s: 2.2.0 > Uppercase Kafka topics fail when startingOffsets are SpecificOffsets > > > Key: SPARK-19853 > URL: https://issues.apache.org/jira/browse/SPARK-19853 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Chris Bowden >Priority: Trivial > > When using the KafkaSource with Structured Streaming, consumer assignments > are not what the user expects if startingOffsets is set to an explicit set of > topics/partitions in JSON where the topic(s) happen to have uppercase > characters. When StartingOffsets is constructed, the original string value > from options is transformed toLowerCase to make matching on "earliest" and > "latest" case insensitive. However, the toLowerCase json is passed to > SpecificOffsets for the terminal condition, so topic names may not be what > the user intended by the time assignments are made with the underlying > KafkaConsumer. > From KafkaSourceProvider: > {code} > val startingOffsets = > caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) > match { > case Some("latest") => LatestOffsets > case Some("earliest") => EarliestOffsets > case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json)) > case None => LatestOffsets > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
[ https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bowden updated SPARK-19853: - Description: When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase json is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer. >From KafkaSourceProvider: {code} val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match { case Some("latest") => LatestOffsets case Some("earliest") => EarliestOffsets case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json)) case None => LatestOffsets } {code} was:When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase json is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer. > Uppercase Kafka topics fail when startingOffsets are SpecificOffsets > > > Key: SPARK-19853 > URL: https://issues.apache.org/jira/browse/SPARK-19853 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Chris Bowden >Priority: Trivial > > When using the KafkaSource with Structured Streaming, consumer assignments > are not what the user expects if startingOffsets is set to an explicit set of > topics/partitions in JSON where the topic(s) happen to have uppercase > characters. When StartingOffsets is constructed, the original string value > from options is transformed toLowerCase to make matching on "earliest" and > "latest" case insensitive. However, the toLowerCase json is passed to > SpecificOffsets for the terminal condition, so topic names may not be what > the user intended by the time assignments are made with the underlying > KafkaConsumer. > From KafkaSourceProvider: > {code} > val startingOffsets = > caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) > match { > case Some("latest") => LatestOffsets > case Some("earliest") => EarliestOffsets > case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json)) > case None => LatestOffsets > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
Chris Bowden created SPARK-19853: Summary: Uppercase Kafka topics fail when startingOffsets are SpecificOffsets Key: SPARK-19853 URL: https://issues.apache.org/jira/browse/SPARK-19853 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Chris Bowden Priority: Trivial When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase json is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900220#comment-15900220 ] Chris Rogers edited comment on SPARK-16207 at 3/7/17 9:52 PM: -- [~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made. I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions". Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no"). Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be. I'm new to Scala so not sure if this is practical, but maybe the appropriate methods could be moved to an `RDDPreservesOrdering` class with an implicit conversion, akin to `PairRDDFunctions`? was (Author: rcrogers): [~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made. I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions". Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no"). Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900220#comment-15900220 ] Chris Rogers commented on SPARK-16207: -- [~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made. I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions". Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no"). Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900216#comment-15900216 ] Nick Afshartous commented on SPARK-19767: - Missed that one, thanks. > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19702) Increasse refuse_seconds timeout in the Mesos Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19702: - Assignee: Michael Gummelt Priority: Minor (was: Major) Fix Version/s: 2.2.0 Resolved by https://github.com/apache/spark/pull/17031 > Increasse refuse_seconds timeout in the Mesos Spark Dispatcher > -- > > Key: SPARK-19702 > URL: https://issues.apache.org/jira/browse/SPARK-19702 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt >Priority: Minor > Fix For: 2.2.0 > > > Due to the problem described here: > https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos > frameworks concurrently can result in starvation. For example, running 10 > dispatchers could result in 5 of them getting all the offers, even if they > have no jobs to launch. We must implement increase the refuse_seconds > timeout to solve this problem. Another option would have been to implement > suppress/revive, but that can cause starvation due to the unreliability of > mesos RPC calls. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900207#comment-15900207 ] Sean Owen commented on SPARK-19767: --- Oh, are you not running from the {{docs/}} directory? > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900184#comment-15900184 ] Nick Afshartous commented on SPARK-19767: - Yes, I completed the steps in the Prerequisites section of https://github.com/apache/spark/blob/master/docs/README.md and got the same error about unknown tag {{include_example}} on two different computers (Linux and OSX). Looks like {{include_example}} is local in {{./docs/_plugins/include_example.rb}}, so maybe this is some kind of path issue where its not finding the local file ? > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch
[ https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19561. Resolution: Fixed Assignee: Jason White Fix Version/s: 2.2.0 2.1.1 > Pyspark Dataframes don't allow timestamps near epoch > > > Key: SPARK-19561 > URL: https://issues.apache.org/jira/browse/SPARK-19561 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Jason White >Assignee: Jason White > Fix For: 2.1.1, 2.2.0 > > > Pyspark does not allow timestamps at or near the epoch to be created in a > DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299 > TimestampType.toInternal converts a datetime object to a number representing > microseconds since the epoch. For all times more than 2148 seconds before or > after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J > automatically serializes it as a long. > However, for times within this range (~35 minutes before or after the epoch), > Py4J serializes it as an int. When creating the object on the Scala side, > ints are not recognized and the value goes to null. This leads to null values > in non-nullable fields, and corrupted Parquet files. > The solution is trivial - force TimestampType.toInternal to always return a > long. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900162#comment-15900162 ] Sean Owen commented on SPARK-16207: --- [~rcrogers] where would you document this? we could add a document in a place that would have helped you. However as I say, lots of things don't preserve order, so I don't know if it's sensible to write that everywhere. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900152#comment-15900152 ] Chris Rogers commented on SPARK-16207: -- The lack of documentation on this is immensely confusing. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900127#comment-15900127 ] Shixiong Zhu commented on SPARK-19764: -- So you don't set an UncaughtExceptionHandler and this OOM happened on the driver side? If so, then it's not a Spark bug. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19851: - Component/s: (was: Spark Core) > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Add support for EVERY and ANY (SOME) aggregates. > - EVERY returns true if all input values are true. > - ANY returns true if at least one input value is true. > - SOME is equivalent to ANY. > Both aggregates are part of the SQL standard. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Labels: flaky-test (was: ) > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Affects Version/s: (was: 2.3.0) 2.2.0 > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Component/s: Tests > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19803. Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.2.0 Thanks for fixing this [~uncleGen] and for reporting it [~sitalke...@gmail.com] > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19516) update public doc to use SparkSession instead of SparkContext
[ https://issues.apache.org/jira/browse/SPARK-19516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19516. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16856 [https://github.com/apache/spark/pull/16856] > update public doc to use SparkSession instead of SparkContext > - > > Key: SPARK-19516 > URL: https://issues.apache.org/jira/browse/SPARK-19516 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1599#comment-1599 ] Ari Gesher commented on SPARK-19764: We were collecting more data than we had heap for. Still useful? > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
Joseph K. Bradley created SPARK-19852: - Summary: StringIndexer.setHandleInvalid should have another option 'new': Python API and docs Key: SPARK-19852 URL: https://issues.apache.org/jira/browse/SPARK-19852 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Priority: Minor Update Python API for StringIndexer so setHandleInvalid doc is correct. This will probably require: * putting HandleInvalid within StringIndexer to update its built-in doc (See Bucketizer for an example.) * updating API docs and maybe the guide -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-17498. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16883 [https://github.com/apache/spark/pull/16883] > StringIndexer.setHandleInvalid should have another option 'new' > --- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Miroslav Balaz >Assignee: Vincent >Priority: Minor > Fix For: 2.2.0 > > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-19851: --- Description: Add support for EVERY and ANY (SOME) aggregates. - EVERY returns true if all input values are true. - ANY returns true if at least one input value is true. - SOME is equivalent to ANY. Both aggregates are part of the SQL standard. > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Add support for EVERY and ANY (SOME) aggregates. > - EVERY returns true if all input values are true. > - ANY returns true if at least one input value is true. > - SOME is equivalent to ANY. > Both aggregates are part of the SQL standard. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899988#comment-15899988 ] Michael Styles commented on SPARK-19851: https://github.com/apache/spark/pull/17194 > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899979#comment-15899979 ] Apache Spark commented on SPARK-19348: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/17193 > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Vinayak Joshi >Assignee: Bryan Cutler > Fix For: 2.2.0 > > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899980#comment-15899980 ] Shixiong Zhu commented on SPARK-19764: -- [~agesher] Do you have the OOM stack trace? So that we can fix it. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
Michael Styles created SPARK-19851: -- Summary: Add support for EVERY and ANY (SOME) aggregates Key: SPARK-19851 URL: https://issues.apache.org/jira/browse/SPARK-19851 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, SQL Affects Versions: 2.1.0 Reporter: Michael Styles -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Gesher resolved SPARK-19764. Resolution: Not A Bug > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899964#comment-15899964 ] Ari Gesher commented on SPARK-19764: We narrowed this down to driver OOM that wasn't being properly propagated into our Jupyter Notebook. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18549. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.2.0 > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.0 > > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table
[ https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19765. - Resolution: Fixed Fix Version/s: 2.2.0 > UNCACHE TABLE should also un-cache all cached plans that refer to this table > > > Key: SPARK-19765 > URL: https://issues.apache.org/jira/browse/SPARK-19765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes > Fix For: 2.2.0 > > > DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, > UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all > the cached plans that refer to this table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org