[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214673#comment-16214673 ] Weichen Xu commented on SPARK-21866: [~josephkb] The datasource API has advantage of expoloiting SQL optimizer. (filter push-down & column pruning), e.g: {code} spark.read.image(...).filter("image.width > 100").cache() {code} Datasource API allow us to do some optimization to avoid scanning images which "image.width <=100" (i.e we can get filter information through datasource reader interface). But, do we really need such optimization ? > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be
[jira] [Commented] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not
[ https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214667#comment-16214667 ] yuhao yang commented on SPARK-22331: cc [~WeichenXu123] > Strength consistency for supporting string params: case-insensitive or not > -- > > Key: SPARK-22331 > URL: https://issues.apache.org/jira/browse/SPARK-22331 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > Some String params in ML are still case-sensitive, as they are checked by > ParamValidators.inArray. > For consistency in user experience, there should be some general guideline in > whether String params in Spark MLlib are case-insensitive or not. > I'm leaning towards making all String params case-insensitive where possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not
yuhao yang created SPARK-22331: -- Summary: Strength consistency for supporting string params: case-insensitive or not Key: SPARK-22331 URL: https://issues.apache.org/jira/browse/SPARK-22331 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor Some String params in ML are still case-sensitive, as they are checked by ParamValidators.inArray. For consistency in user experience, there should be some general guideline in whether String params in Spark MLlib are case-insensitive or not. I'm leaning towards making all String params case-insensitive where possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22281) Handle R method breaking signature changes
[ https://issues.apache.org/jira/browse/SPARK-22281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214663#comment-16214663 ] Shivaram Venkataraman commented on SPARK-22281: --- Thanks for looking into this [~felixcheung] Is there anyway we can remove the `usage` entry as well from the Rdoc ? This might also be something to raise with the roxygen project for a more long term solution > Handle R method breaking signature changes > -- > > Key: SPARK-22281 > URL: https://issues.apache.org/jira/browse/SPARK-22281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1, 2.3.0 >Reporter: Felix Cheung > > cAs discussed here > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555 > this WARNING on R-devel > * checking for code/documentation mismatches ... WARNING > Codoc mismatches from documentation object 'attach': > attach > Code: function(what, pos = 2L, name = deparse(substitute(what), > backtick = FALSE), warn.conflicts = TRUE) > Docs: function(what, pos = 2L, name = deparse(substitute(what)), > warn.conflicts = TRUE) > Mismatches in argument default values: > Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: > deparse(substitute(what)) > Checked the latest release R 3.4.1 and the signature change wasn't there. > This likely indicated an upcoming change in the next R release that could > incur this new warning when we attempt to publish the package. > Not sure what we can do now since we work with multiple versions of R and > they will have different signatures then. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite
[ https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214601#comment-16214601 ] Nathan Kronenfeld commented on SPARK-22308: --- ok, the documentation is taken out... I'll make a new issue for that, but I want to think about the wording, so I'll do it over the next couple days. > Support unit tests of spark code using ScalaTest using suites other than > FunSuite > - > > Key: SPARK-22308 > URL: https://issues.apache.org/jira/browse/SPARK-22308 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core, SQL, Tests >Affects Versions: 2.2.0 >Reporter: Nathan Kronenfeld >Priority: Minor > Labels: scalatest, test-suite, test_issue > > External codebases that have spark code can test it using SharedSparkContext, > no matter how they write their scalatests - basing on FunSuite, FunSpec, > FlatSpec, or WordSpec. > SharedSQLContext only supports FunSuite. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug
[ https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao closed SPARK-22272. --- > killing task may cause the executor progress hang because of the JVM bug > > > Key: SPARK-22272 > URL: https://issues.apache.org/jira/browse/SPARK-22272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2 > Environment: java version "1.7.0_75" > hadoop version 2.5.0 >Reporter: roncenzhao > Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png > > > JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693 > We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to > merge all partitions files when 'spark.file.transferTo' is true(default), so > it may cause the jvm bug. > When the driver send one task to this bad executor, the task will never run > and as a result the job will hang forever without handling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug
[ https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-22272. - Resolution: Won't Fix > killing task may cause the executor progress hang because of the JVM bug > > > Key: SPARK-22272 > URL: https://issues.apache.org/jira/browse/SPARK-22272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2 > Environment: java version "1.7.0_75" > hadoop version 2.5.0 >Reporter: roncenzhao > Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png > > > JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693 > We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to > merge all partitions files when 'spark.file.transferTo' is true(default), so > it may cause the jvm bug. > When the driver send one task to this bad executor, the task will never run > and as a result the job will hang forever without handling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug
[ https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214569#comment-16214569 ] Saisai Shao commented on SPARK-22272: - Setting "spark.file.transferTo" to "false" will potentially affect the performance, that's why we enabled this by default and leave an undocumented configurations if users has some issues. The original JIRA is SPARK-3948, which is a kernel issue. I don't think we should disable this by default, since it is JDK/Kernel specific. If you encountered such problems, you can disable it in your cluster, but generally we don't want to disable this to hurt the performance. > killing task may cause the executor progress hang because of the JVM bug > > > Key: SPARK-22272 > URL: https://issues.apache.org/jira/browse/SPARK-22272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2 > Environment: java version "1.7.0_75" > hadoop version 2.5.0 >Reporter: roncenzhao > Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png > > > JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693 > We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to > merge all partitions files when 'spark.file.transferTo' is true(default), so > it may cause the jvm bug. > When the driver send one task to this bad executor, the task will never run > and as a result the job will hang forever without handling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab
[ https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214552#comment-16214552 ] Apache Spark commented on SPARK-22319: -- User 'sjrand' has created a pull request for this issue: https://github.com/apache/spark/pull/19554 > SparkSubmit calls getFileStatus before calling loginUserFromKeytab > -- > > Key: SPARK-22319 > URL: https://issues.apache.org/jira/browse/SPARK-22319 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Steven Rand >Assignee: Steven Rand > Fix For: 2.3.0 > > > In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls > {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346. > We do this before we call {{loginUserFromKeytab}}, which is further down in > the same method: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655. > The result is that the call to {{resolveGlobPaths}} fails in secure clusters > with: > {code} > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > {code} > A workaround is to {{kinit}} on the host before using spark-submit. However, > it's better if this workaround isn't necessary. A simple fix is to call > loginUserFromKeytab before attempting to interact with HDFS. > At least for cluster mode, this would appear to be a regression caused by > SPARK-21012. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab
[ https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-22319: --- Assignee: Steven Rand > SparkSubmit calls getFileStatus before calling loginUserFromKeytab > -- > > Key: SPARK-22319 > URL: https://issues.apache.org/jira/browse/SPARK-22319 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Steven Rand >Assignee: Steven Rand > Fix For: 2.3.0 > > > In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls > {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346. > We do this before we call {{loginUserFromKeytab}}, which is further down in > the same method: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655. > The result is that the call to {{resolveGlobPaths}} fails in secure clusters > with: > {code} > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > {code} > A workaround is to {{kinit}} on the host before using spark-submit. However, > it's better if this workaround isn't necessary. A simple fix is to call > loginUserFromKeytab before attempting to interact with HDFS. > At least for cluster mode, this would appear to be a regression caused by > SPARK-21012. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab
[ https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-22319. - Resolution: Fixed Fix Version/s: 2.3.0 > SparkSubmit calls getFileStatus before calling loginUserFromKeytab > -- > > Key: SPARK-22319 > URL: https://issues.apache.org/jira/browse/SPARK-22319 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Steven Rand > Fix For: 2.3.0 > > > In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls > {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346. > We do this before we call {{loginUserFromKeytab}}, which is further down in > the same method: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655. > The result is that the call to {{resolveGlobPaths}} fails in secure clusters > with: > {code} > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > {code} > A workaround is to {{kinit}} on the host before using spark-submit. However, > it's better if this workaround isn't necessary. A simple fix is to call > loginUserFromKeytab before attempting to interact with HDFS. > At least for cluster mode, this would appear to be a regression caused by > SPARK-21012. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab
[ https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-22319: Affects Version/s: 2.2.0 > SparkSubmit calls getFileStatus before calling loginUserFromKeytab > -- > > Key: SPARK-22319 > URL: https://issues.apache.org/jira/browse/SPARK-22319 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Steven Rand > Fix For: 2.3.0 > > > In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls > {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346. > We do this before we call {{loginUserFromKeytab}}, which is further down in > the same method: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655. > The result is that the call to {{resolveGlobPaths}} fails in secure clusters > with: > {code} > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > {code} > A workaround is to {{kinit}} on the host before using spark-submit. However, > it's better if this workaround isn't necessary. A simple fix is to call > loginUserFromKeytab before attempting to interact with HDFS. > At least for cluster mode, this would appear to be a regression caused by > SPARK-21012. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22330) Linear containsKey operation for serialized maps.
[ https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214472#comment-16214472 ] Apache Spark commented on SPARK-22330: -- User 'Whoosh' has created a pull request for this issue: https://github.com/apache/spark/pull/19553 > Linear containsKey operation for serialized maps. > - > > Key: SPARK-22330 > URL: https://issues.apache.org/jira/browse/SPARK-22330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 2.2.0 >Reporter: Alexander > Labels: performance > Original Estimate: 5m > Remaining Estimate: 5m > > One of our production application which aggressively uses cached spark RDDs > degraded after increasing volumes of data though it shouldn't. Fast profiling > session showed that the slowest part was SerializableMapWrapper#containsKey: > it delegates get and remove to actual implementation, but containsKey is > inherited from AbstractMap which is implemented in linear time via iteration > over whole keySet. A workaround was simple: replacing all containsKey with > get(key) != null solved the issue. > Nevertheless, it would be much simpler for everyone if the issue will be > fixed once and for all. > A fix is straightforward, delegate containsKey to actual implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22330) Linear containsKey operation for serialized maps.
[ https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22330: Assignee: (was: Apache Spark) > Linear containsKey operation for serialized maps. > - > > Key: SPARK-22330 > URL: https://issues.apache.org/jira/browse/SPARK-22330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 2.2.0 >Reporter: Alexander > Labels: performance > Original Estimate: 5m > Remaining Estimate: 5m > > One of our production application which aggressively uses cached spark RDDs > degraded after increasing volumes of data though it shouldn't. Fast profiling > session showed that the slowest part was SerializableMapWrapper#containsKey: > it delegates get and remove to actual implementation, but containsKey is > inherited from AbstractMap which is implemented in linear time via iteration > over whole keySet. A workaround was simple: replacing all containsKey with > get(key) != null solved the issue. > Nevertheless, it would be much simpler for everyone if the issue will be > fixed once and for all. > A fix is straightforward, delegate containsKey to actual implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22330) Linear containsKey operation for serialized maps.
[ https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22330: Assignee: Apache Spark > Linear containsKey operation for serialized maps. > - > > Key: SPARK-22330 > URL: https://issues.apache.org/jira/browse/SPARK-22330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 2.2.0 >Reporter: Alexander >Assignee: Apache Spark > Labels: performance > Original Estimate: 5m > Remaining Estimate: 5m > > One of our production application which aggressively uses cached spark RDDs > degraded after increasing volumes of data though it shouldn't. Fast profiling > session showed that the slowest part was SerializableMapWrapper#containsKey: > it delegates get and remove to actual implementation, but containsKey is > inherited from AbstractMap which is implemented in linear time via iteration > over whole keySet. A workaround was simple: replacing all containsKey with > get(key) != null solved the issue. > Nevertheless, it would be much simpler for everyone if the issue will be > fixed once and for all. > A fix is straightforward, delegate containsKey to actual implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22330) Linear containsKey operation for serialized maps.
Alexander created SPARK-22330: - Summary: Linear containsKey operation for serialized maps. Key: SPARK-22330 URL: https://issues.apache.org/jira/browse/SPARK-22330 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0, 1.2.1 Reporter: Alexander One of our production application which aggressively uses cached spark RDDs degraded after increasing volumes of data though it shouldn't. Fast profiling session showed that the slowest part was SerializableMapWrapper#containsKey: it delegates get and remove to actual implementation, but containsKey is inherited from AbstractMap which is implemented in linear time via iteration over whole keySet. A workaround was simple: replacing all containsKey with get(key) != null solved the issue. Nevertheless, it would be much simpler for everyone if the issue will be fixed once and for all. A fix is straightforward, delegate containsKey to actual implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22281) Handle R method breaking signature changes
[ https://issues.apache.org/jira/browse/SPARK-22281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214431#comment-16214431 ] Felix Cheung commented on SPARK-22281: -- tried a few things. If we remove the {code} @param {code} then cran checks fail with {code} * checking Rd \usage sections ... WARNING Undocumented arguments in documentation object 'attach' ‘what’ ‘pos’ ‘name’ ‘warn.conflicts’ Functions with \usage entries need to have the appropriate \alias entries, and all their arguments documented. The \usage entries must correspond to syntactically valid R code. See chapter ‘Writing R documentation files’ in the ‘Writing R Extensions’ manual. {code} if we change the method signature to {code} setMethod("attach", signature(what = "SparkDataFrame"), function(what, ...) { {code} Then it fails to install {code} Error in rematchDefinition(definition, fdef, mnames, fnames, signature) : methods can add arguments to the generic ‘attach’ only if '...' is an argument to the generic Error : unable to load R code in package ‘SparkR’ ERROR: lazy loading failed for package ‘SparkR’ {code} > Handle R method breaking signature changes > -- > > Key: SPARK-22281 > URL: https://issues.apache.org/jira/browse/SPARK-22281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1, 2.3.0 >Reporter: Felix Cheung > > cAs discussed here > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555 > this WARNING on R-devel > * checking for code/documentation mismatches ... WARNING > Codoc mismatches from documentation object 'attach': > attach > Code: function(what, pos = 2L, name = deparse(substitute(what), > backtick = FALSE), warn.conflicts = TRUE) > Docs: function(what, pos = 2L, name = deparse(substitute(what)), > warn.conflicts = TRUE) > Mismatches in argument default values: > Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: > deparse(substitute(what)) > Checked the latest release R 3.4.1 and the signature change wasn't there. > This likely indicated an upcoming change in the next R release that could > incur this new warning when we attempt to publish the package. > Not sure what we can do now since we work with multiple versions of R and > they will have different signatures then. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
[ https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22329: Assignee: Apache Spark > Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default > -- > > Key: SPARK-22329 > URL: https://issues.apache.org/jira/browse/SPARK-22329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Critical > > In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical > issue. > - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some > Hive tables backed by case-sensitive data files. > bq. This situation will occur for any Hive table that wasn't created by Spark > or that was created prior to Spark 2.1.0. If a user attempts to run a query > over such a table containing a case-sensitive field name in the query > projection or in the query filter, the query will return 0 results in every > case. > - However, SPARK-22306 reports this also corrupts Hive Metastore schema by > removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. > - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look > okay at least. However, we need to figure out the issue of changing owners. > Also, we cannot backport bucketing patch into `branch-2.2`. We need more > tests on before releasing 2.3.0. > Hive Metastore is a shared resource and Spark should not corrupt it by > default. This issue proposes to recover that option back to `NEVER_INFO` like > Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by > themselves. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
[ https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22329: Assignee: (was: Apache Spark) > Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default > -- > > Key: SPARK-22329 > URL: https://issues.apache.org/jira/browse/SPARK-22329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > > In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical > issue. > - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some > Hive tables backed by case-sensitive data files. > bq. This situation will occur for any Hive table that wasn't created by Spark > or that was created prior to Spark 2.1.0. If a user attempts to run a query > over such a table containing a case-sensitive field name in the query > projection or in the query filter, the query will return 0 results in every > case. > - However, SPARK-22306 reports this also corrupts Hive Metastore schema by > removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. > - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look > okay at least. However, we need to figure out the issue of changing owners. > Also, we cannot backport bucketing patch into `branch-2.2`. We need more > tests on before releasing 2.3.0. > Hive Metastore is a shared resource and Spark should not corrupt it by > default. This issue proposes to recover that option back to `NEVER_INFO` like > Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by > themselves. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
[ https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214410#comment-16214410 ] Apache Spark commented on SPARK-22329: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19552 > Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default > -- > > Key: SPARK-22329 > URL: https://issues.apache.org/jira/browse/SPARK-22329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > > In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical > issue. > - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some > Hive tables backed by case-sensitive data files. > bq. This situation will occur for any Hive table that wasn't created by Spark > or that was created prior to Spark 2.1.0. If a user attempts to run a query > over such a table containing a case-sensitive field name in the query > projection or in the query filter, the query will return 0 results in every > case. > - However, SPARK-22306 reports this also corrupts Hive Metastore schema by > removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. > - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look > okay at least. However, we need to figure out the issue of changing owners. > Also, we cannot backport bucketing patch into `branch-2.2`. We need more > tests on before releasing 2.3.0. > Hive Metastore is a shared resource and Spark should not corrupt it by > default. This issue proposes to recover that option back to `NEVER_INFO` like > Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by > themselves. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
Dongjoon Hyun created SPARK-22329: - Summary: Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default Key: SPARK-22329 URL: https://issues.apache.org/jira/browse/SPARK-22329 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Dongjoon Hyun Priority: Critical In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical issue. - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files. bq. This situation will occur for any Hive table that wasn't created by Spark or that was created prior to Spark 2.1.0. If a user attempts to run a query over such a table containing a case-sensitive field name in the query projection or in the query filter, the query will return 0 results in every case. - However, SPARK-22306 reports this also corrupts Hive Metastore schema by removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look okay at least. However, we need to figure out the issue of changing owners. Also, we cannot backport bucketing patch into `branch-2.2`. We need more tests on before releasing 2.3.0. Hive Metastore is a shared resource and Spark should not corrupt it by default. This issue proposes to recover that option back to `NEVER_INFO` like Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by themselves. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214332#comment-16214332 ] Kazuaki Ishizaki commented on SPARK-22284: -- Yes, there was a similar ticket that was solved. Your case is more complicated case that includes `struct`. I created a repro in my environment. I think that it could be solvable. I will create a PR this week. To fix a problem, we usually submit a PR to the master branch. Then, we consider whether the PR could be applicable to previous versions based on risk assessments by committers. > Code of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > -- > > Key: SPARK-22284 > URL: https://issues.apache.org/jira/browse/SPARK-22284 > Project: Spark > Issue Type: Bug > Components: Optimizer, PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Ben > Attachments: 64KB Error.log > > > I am using pySpark 2.1.0 in a production environment, and trying to join two > DataFrames, one of which is very large and has complex nested structures. > Basically, I load both DataFrames and cache them. > Then, in the large DataFrame, I extract 3 nested values and save them as > direct columns. > Finally, I join on these three columns with the smaller DataFrame. > This would be a short code for this: > {code} > dataFrame.read..cache() > dataFrameSmall.read...cache() > dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS > Value1','nested.Value2 AS Value2','nested.Value3 AS Value3']) > dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, > ['Value1','Value2',Value3']) > dataFrame.count() > {code} > And this is the error I get when it gets to the count(): > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in > stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 > (TID 11234, somehost.com, executor 10): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\" > of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > {code} > I have seen many tickets with similar issues here, but no proper solution. > Most of the fixes are until Spark 2.1.0 so I don't know if running it on > Spark 2.2.0 would fix it. In any case I cannot change the version of Spark > since it is in production. > I have also tried setting > {code:java} > spark.sql.codegen.wholeStage=false > {code} > but still the same error. > The job worked well up to now, also with large datasets, but apparently this > batch got larger, and that is the only thing that changed. Is there any > workaround for this? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-2620: -- Affects Version/s: 2.2.0 > case class cannot be used as key for reduce > --- > > Key: SPARK-2620 > URL: https://issues.apache.org/jira/browse/SPARK-2620 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.0.0, 1.1.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, > 2.2.0 > Environment: reproduced on spark-shell local[4] >Reporter: Gerard Maas >Assignee: Tobias Schlatter >Priority: Critical > Labels: case-class, core > > Using a case class as a key doesn't seem to work properly on Spark 1.0.0 > A minimal example: > case class P(name:String) > val ps = Array(P("alice"), P("bob"), P("charly"), P("bob")) > sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect > [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), > (P(bob),1), (P(abe),1), (P(charly),1)) > In contrast to the expected behavior, that should be equivalent to: > sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect > Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) > groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17902) collect() ignores stringsAsFactors
[ https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17902: Assignee: (was: Apache Spark) > collect() ignores stringsAsFactors > -- > > Key: SPARK-17902 > URL: https://issues.apache.org/jira/browse/SPARK-17902 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > `collect()` function signature includes an optional flag named > `stringsAsFactors`. It seems it is completely ignored. > {code} > str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors
[ https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214263#comment-16214263 ] Apache Spark commented on SPARK-17902: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19551 > collect() ignores stringsAsFactors > -- > > Key: SPARK-17902 > URL: https://issues.apache.org/jira/browse/SPARK-17902 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > `collect()` function signature includes an optional flag named > `stringsAsFactors`. It seems it is completely ignored. > {code} > str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17902) collect() ignores stringsAsFactors
[ https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17902: Assignee: Apache Spark > collect() ignores stringsAsFactors > -- > > Key: SPARK-17902 > URL: https://issues.apache.org/jira/browse/SPARK-17902 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki >Assignee: Apache Spark > > `collect()` function signature includes an optional flag named > `stringsAsFactors`. It seems it is completely ignored. > {code} > str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22142) Move Flume support behind a profile
[ https://issues.apache.org/jira/browse/SPARK-22142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214238#comment-16214238 ] Hyukjin Kwon commented on SPARK-22142: -- Let me leave a link to dev mailing list - http://apache-spark-developers-list.1001551.n3.nabble.com/Should-Flume-integration-be-behind-a-profile-td22535.html > Move Flume support behind a profile > --- > > Key: SPARK-22142 > URL: https://issues.apache.org/jira/browse/SPARK-22142 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: releasenotes > Fix For: 2.3.0 > > > Kafka 0.8 support was recently put behind a profile. YARN, Mesos, Kinesis, > Docker-related integration are behind profiles. Flume support seems like it > could as well, to make it opt-in for builds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug
[ https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214220#comment-16214220 ] roncenzhao commented on SPARK-22272: I think the simple way is to set 'spark.file.transferTo' false. I do it in our production env and the problem is never seen again. > killing task may cause the executor progress hang because of the JVM bug > > > Key: SPARK-22272 > URL: https://issues.apache.org/jira/browse/SPARK-22272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2 > Environment: java version "1.7.0_75" > hadoop version 2.5.0 >Reporter: roncenzhao > Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png > > > JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693 > We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to > merge all partitions files when 'spark.file.transferTo' is true(default), so > it may cause the jvm bug. > When the driver send one task to this bad executor, the task will never run > and as a result the job will hang forever without handling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org