[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-10-22 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214673#comment-16214673
 ] 

Weichen Xu commented on SPARK-21866:


[~josephkb]
The datasource API has advantage of expoloiting SQL optimizer. (filter 
push-down & column pruning), e.g:
{code}
spark.read.image(...).filter("image.width > 100").cache()
{code}
Datasource API allow us to do some optimization to avoid scanning images which 
"image.width <=100" (i.e we can get filter information through datasource 
reader interface).
But, do we really need such optimization ?

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be 

[jira] [Commented] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not

2017-10-22 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214667#comment-16214667
 ] 

yuhao yang commented on SPARK-22331:


cc [~WeichenXu123]

> Strength consistency for supporting string params: case-insensitive or not
> --
>
> Key: SPARK-22331
> URL: https://issues.apache.org/jira/browse/SPARK-22331
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Some String params in ML are still case-sensitive, as they are checked by 
> ParamValidators.inArray.
> For consistency in user experience, there should be some general guideline in 
> whether String params in Spark MLlib are case-insensitive or not. 
> I'm leaning towards making all String params case-insensitive where possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not

2017-10-22 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22331:
--

 Summary: Strength consistency for supporting string params: 
case-insensitive or not
 Key: SPARK-22331
 URL: https://issues.apache.org/jira/browse/SPARK-22331
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


Some String params in ML are still case-sensitive, as they are checked by 
ParamValidators.inArray.

For consistency in user experience, there should be some general guideline in 
whether String params in Spark MLlib are case-insensitive or not. 

I'm leaning towards making all String params case-insensitive where possible.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22281) Handle R method breaking signature changes

2017-10-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214663#comment-16214663
 ] 

Shivaram Venkataraman commented on SPARK-22281:
---

Thanks for looking into this [~felixcheung] Is there anyway we can remove the 
`usage` entry as well from the Rdoc ? This might also be something to raise 
with the roxygen project for a more long term solution 

> Handle R method breaking signature changes
> --
>
> Key: SPARK-22281
> URL: https://issues.apache.org/jira/browse/SPARK-22281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> cAs discussed here
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
> this WARNING on R-devel
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
>   Code: function(what, pos = 2L, name = deparse(substitute(what),
>  backtick = FALSE), warn.conflicts = TRUE)
>   Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>  warn.conflicts = TRUE)
>   Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: 
> deparse(substitute(what))
> Checked the latest release R 3.4.1 and the signature change wasn't there. 
> This likely indicated an upcoming change in the next R release that could 
> incur this new warning when we attempt to publish the package.
> Not sure what we can do now since we work with multiple versions of R and 
> they will have different signatures then.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-22 Thread Nathan Kronenfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214601#comment-16214601
 ] 

Nathan Kronenfeld commented on SPARK-22308:
---

ok, the documentation is taken out... I'll make a new issue for that, but I 
want to think about the wording, so I'll do it over the next couple days.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug

2017-10-22 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao closed SPARK-22272.
---

> killing task may cause the executor progress hang because of the JVM bug
> 
>
> Key: SPARK-22272
> URL: https://issues.apache.org/jira/browse/SPARK-22272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2
> Environment: java version "1.7.0_75"
> hadoop version 2.5.0
>Reporter: roncenzhao
> Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png
>
>
> JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693
> We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to 
> merge all partitions files when 'spark.file.transferTo' is true(default), so 
> it may cause the jvm bug.
> When the driver send one task to this bad executor, the task will never run 
> and as a result the job will hang forever without handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug

2017-10-22 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-22272.
-
Resolution: Won't Fix

> killing task may cause the executor progress hang because of the JVM bug
> 
>
> Key: SPARK-22272
> URL: https://issues.apache.org/jira/browse/SPARK-22272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2
> Environment: java version "1.7.0_75"
> hadoop version 2.5.0
>Reporter: roncenzhao
> Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png
>
>
> JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693
> We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to 
> merge all partitions files when 'spark.file.transferTo' is true(default), so 
> it may cause the jvm bug.
> When the driver send one task to this bad executor, the task will never run 
> and as a result the job will hang forever without handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug

2017-10-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214569#comment-16214569
 ] 

Saisai Shao commented on SPARK-22272:
-

Setting "spark.file.transferTo" to "false" will potentially affect the 
performance, that's why we enabled this by default and leave an undocumented 
configurations if users has some issues. The original JIRA is SPARK-3948, which 
is a kernel issue. I don't think we should disable this by default, since it is 
JDK/Kernel specific. If you encountered such problems, you can disable it in 
your cluster, but generally we don't want to disable this to hurt the 
performance.

> killing task may cause the executor progress hang because of the JVM bug
> 
>
> Key: SPARK-22272
> URL: https://issues.apache.org/jira/browse/SPARK-22272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2
> Environment: java version "1.7.0_75"
> hadoop version 2.5.0
>Reporter: roncenzhao
> Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png
>
>
> JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693
> We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to 
> merge all partitions files when 'spark.file.transferTo' is true(default), so 
> it may cause the jvm bug.
> When the driver send one task to this bad executor, the task will never run 
> and as a result the job will hang forever without handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214552#comment-16214552
 ] 

Apache Spark commented on SPARK-22319:
--

User 'sjrand' has created a pull request for this issue:
https://github.com/apache/spark/pull/19554

> SparkSubmit calls getFileStatus before calling loginUserFromKeytab
> --
>
> Key: SPARK-22319
> URL: https://issues.apache.org/jira/browse/SPARK-22319
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Fix For: 2.3.0
>
>
> In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
> {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.
> We do this before we call {{loginUserFromKeytab}}, which is further down in 
> the same method: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.
> The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
> with:
> {code}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> {code}
> A workaround is to {{kinit}} on the host before using spark-submit. However, 
> it's better if this workaround isn't necessary. A simple fix is to call 
> loginUserFromKeytab before attempting to interact with HDFS.
> At least for cluster mode, this would appear to be a regression caused by 
> SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-22 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-22319:
---

Assignee: Steven Rand

> SparkSubmit calls getFileStatus before calling loginUserFromKeytab
> --
>
> Key: SPARK-22319
> URL: https://issues.apache.org/jira/browse/SPARK-22319
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Fix For: 2.3.0
>
>
> In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
> {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.
> We do this before we call {{loginUserFromKeytab}}, which is further down in 
> the same method: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.
> The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
> with:
> {code}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> {code}
> A workaround is to {{kinit}} on the host before using spark-submit. However, 
> it's better if this workaround isn't necessary. A simple fix is to call 
> loginUserFromKeytab before attempting to interact with HDFS.
> At least for cluster mode, this would appear to be a regression caused by 
> SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-22 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-22319.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> SparkSubmit calls getFileStatus before calling loginUserFromKeytab
> --
>
> Key: SPARK-22319
> URL: https://issues.apache.org/jira/browse/SPARK-22319
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Steven Rand
> Fix For: 2.3.0
>
>
> In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
> {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.
> We do this before we call {{loginUserFromKeytab}}, which is further down in 
> the same method: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.
> The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
> with:
> {code}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> {code}
> A workaround is to {{kinit}} on the host before using spark-submit. However, 
> it's better if this workaround isn't necessary. A simple fix is to call 
> loginUserFromKeytab before attempting to interact with HDFS.
> At least for cluster mode, this would appear to be a regression caused by 
> SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-22 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-22319:

Affects Version/s: 2.2.0

> SparkSubmit calls getFileStatus before calling loginUserFromKeytab
> --
>
> Key: SPARK-22319
> URL: https://issues.apache.org/jira/browse/SPARK-22319
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Steven Rand
> Fix For: 2.3.0
>
>
> In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
> {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.
> We do this before we call {{loginUserFromKeytab}}, which is further down in 
> the same method: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.
> The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
> with:
> {code}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> {code}
> A workaround is to {{kinit}} on the host before using spark-submit. However, 
> it's better if this workaround isn't necessary. A simple fix is to call 
> loginUserFromKeytab before attempting to interact with HDFS.
> At least for cluster mode, this would appear to be a regression caused by 
> SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22330) Linear containsKey operation for serialized maps.

2017-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214472#comment-16214472
 ] 

Apache Spark commented on SPARK-22330:
--

User 'Whoosh' has created a pull request for this issue:
https://github.com/apache/spark/pull/19553

> Linear containsKey operation for serialized maps.
> -
>
> Key: SPARK-22330
> URL: https://issues.apache.org/jira/browse/SPARK-22330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 2.2.0
>Reporter: Alexander
>  Labels: performance
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> One of our production application which aggressively uses cached spark RDDs 
> degraded after increasing volumes of data though it shouldn't. Fast profiling 
> session showed that the slowest part was SerializableMapWrapper#containsKey: 
> it delegates get and remove to actual implementation, but containsKey is 
> inherited from AbstractMap which is implemented in linear time via iteration 
> over whole keySet. A workaround was simple: replacing all containsKey with 
> get(key) != null solved the issue.
> Nevertheless, it would be much simpler for everyone if the issue will be 
> fixed once and for all.
> A fix is straightforward, delegate containsKey to actual implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22330) Linear containsKey operation for serialized maps.

2017-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22330:


Assignee: (was: Apache Spark)

> Linear containsKey operation for serialized maps.
> -
>
> Key: SPARK-22330
> URL: https://issues.apache.org/jira/browse/SPARK-22330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 2.2.0
>Reporter: Alexander
>  Labels: performance
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> One of our production application which aggressively uses cached spark RDDs 
> degraded after increasing volumes of data though it shouldn't. Fast profiling 
> session showed that the slowest part was SerializableMapWrapper#containsKey: 
> it delegates get and remove to actual implementation, but containsKey is 
> inherited from AbstractMap which is implemented in linear time via iteration 
> over whole keySet. A workaround was simple: replacing all containsKey with 
> get(key) != null solved the issue.
> Nevertheless, it would be much simpler for everyone if the issue will be 
> fixed once and for all.
> A fix is straightforward, delegate containsKey to actual implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22330) Linear containsKey operation for serialized maps.

2017-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22330:


Assignee: Apache Spark

> Linear containsKey operation for serialized maps.
> -
>
> Key: SPARK-22330
> URL: https://issues.apache.org/jira/browse/SPARK-22330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 2.2.0
>Reporter: Alexander
>Assignee: Apache Spark
>  Labels: performance
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> One of our production application which aggressively uses cached spark RDDs 
> degraded after increasing volumes of data though it shouldn't. Fast profiling 
> session showed that the slowest part was SerializableMapWrapper#containsKey: 
> it delegates get and remove to actual implementation, but containsKey is 
> inherited from AbstractMap which is implemented in linear time via iteration 
> over whole keySet. A workaround was simple: replacing all containsKey with 
> get(key) != null solved the issue.
> Nevertheless, it would be much simpler for everyone if the issue will be 
> fixed once and for all.
> A fix is straightforward, delegate containsKey to actual implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22330) Linear containsKey operation for serialized maps.

2017-10-22 Thread Alexander (JIRA)
Alexander created SPARK-22330:
-

 Summary: Linear containsKey operation for serialized maps.
 Key: SPARK-22330
 URL: https://issues.apache.org/jira/browse/SPARK-22330
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0, 1.2.1
Reporter: Alexander


One of our production application which aggressively uses cached spark RDDs 
degraded after increasing volumes of data though it shouldn't. Fast profiling 
session showed that the slowest part was SerializableMapWrapper#containsKey: it 
delegates get and remove to actual implementation, but containsKey is inherited 
from AbstractMap which is implemented in linear time via iteration over whole 
keySet. A workaround was simple: replacing all containsKey with get(key) != 
null solved the issue.

Nevertheless, it would be much simpler for everyone if the issue will be fixed 
once and for all.
A fix is straightforward, delegate containsKey to actual implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22281) Handle R method breaking signature changes

2017-10-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214431#comment-16214431
 ] 

Felix Cheung commented on SPARK-22281:
--

tried a few things. If we remove the 
{code}
@param
{code}

then cran checks fail with
{code}
* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'attach'
  ‘what’ ‘pos’ ‘name’ ‘warn.conflicts’

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
{code}

if we change the method signature to

{code}
setMethod("attach",
  signature(what = "SparkDataFrame"),
  function(what, ...) {
{code}

Then it fails to install
{code}
Error in rematchDefinition(definition, fdef, mnames, fnames, signature) :
  methods can add arguments to the generic ‘attach’ only if '...' is an 
argument to the generic
Error : unable to load R code in package ‘SparkR’
ERROR: lazy loading failed for package ‘SparkR’
{code}


> Handle R method breaking signature changes
> --
>
> Key: SPARK-22281
> URL: https://issues.apache.org/jira/browse/SPARK-22281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> cAs discussed here
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
> this WARNING on R-devel
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
>   Code: function(what, pos = 2L, name = deparse(substitute(what),
>  backtick = FALSE), warn.conflicts = TRUE)
>   Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>  warn.conflicts = TRUE)
>   Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: 
> deparse(substitute(what))
> Checked the latest release R 3.4.1 and the signature change wasn't there. 
> This likely indicated an upcoming change in the next R release that could 
> incur this new warning when we attempt to publish the package.
> Not sure what we can do now since we work with multiple versions of R and 
> they will have different signatures then.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

2017-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22329:


Assignee: Apache Spark

> Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
> --
>
> Key: SPARK-22329
> URL: https://issues.apache.org/jira/browse/SPARK-22329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Critical
>
> In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical 
> issue. 
> - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some 
> Hive tables backed by case-sensitive data files.
> bq. This situation will occur for any Hive table that wasn't created by Spark 
> or that was created prior to Spark 2.1.0. If a user attempts to run a query 
> over such a table containing a case-sensitive field name in the query 
> projection or in the query filter, the query will return 0 results in every 
> case.
> - However, SPARK-22306 reports this also corrupts Hive Metastore schema by 
> removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner.
> - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look 
> okay at least. However, we need to figure out the issue of changing owners. 
> Also, we cannot backport bucketing patch into `branch-2.2`. We need more 
> tests on before releasing 2.3.0.
> Hive Metastore is a shared resource and Spark should not corrupt it by 
> default. This issue proposes to recover that option back to `NEVER_INFO` like 
> Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by 
> themselves.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

2017-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22329:


Assignee: (was: Apache Spark)

> Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
> --
>
> Key: SPARK-22329
> URL: https://issues.apache.org/jira/browse/SPARK-22329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical 
> issue. 
> - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some 
> Hive tables backed by case-sensitive data files.
> bq. This situation will occur for any Hive table that wasn't created by Spark 
> or that was created prior to Spark 2.1.0. If a user attempts to run a query 
> over such a table containing a case-sensitive field name in the query 
> projection or in the query filter, the query will return 0 results in every 
> case.
> - However, SPARK-22306 reports this also corrupts Hive Metastore schema by 
> removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner.
> - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look 
> okay at least. However, we need to figure out the issue of changing owners. 
> Also, we cannot backport bucketing patch into `branch-2.2`. We need more 
> tests on before releasing 2.3.0.
> Hive Metastore is a shared resource and Spark should not corrupt it by 
> default. This issue proposes to recover that option back to `NEVER_INFO` like 
> Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by 
> themselves.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

2017-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214410#comment-16214410
 ] 

Apache Spark commented on SPARK-22329:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19552

> Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
> --
>
> Key: SPARK-22329
> URL: https://issues.apache.org/jira/browse/SPARK-22329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical 
> issue. 
> - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some 
> Hive tables backed by case-sensitive data files.
> bq. This situation will occur for any Hive table that wasn't created by Spark 
> or that was created prior to Spark 2.1.0. If a user attempts to run a query 
> over such a table containing a case-sensitive field name in the query 
> projection or in the query filter, the query will return 0 results in every 
> case.
> - However, SPARK-22306 reports this also corrupts Hive Metastore schema by 
> removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner.
> - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look 
> okay at least. However, we need to figure out the issue of changing owners. 
> Also, we cannot backport bucketing patch into `branch-2.2`. We need more 
> tests on before releasing 2.3.0.
> Hive Metastore is a shared resource and Spark should not corrupt it by 
> default. This issue proposes to recover that option back to `NEVER_INFO` like 
> Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by 
> themselves.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

2017-10-22 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-22329:
-

 Summary: Use NEVER_INFER for 
`spark.sql.hive.caseSensitiveInferenceMode` by default
 Key: SPARK-22329
 URL: https://issues.apache.org/jira/browse/SPARK-22329
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Dongjoon Hyun
Priority: Critical


In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical 
issue. 

- SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive 
tables backed by case-sensitive data files.
bq. This situation will occur for any Hive table that wasn't created by Spark 
or that was created prior to Spark 2.1.0. If a user attempts to run a query 
over such a table containing a case-sensitive field name in the query 
projection or in the query filter, the query will return 0 results in every 
case.

- However, SPARK-22306 reports this also corrupts Hive Metastore schema by 
removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner.

- Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look okay 
at least. However, we need to figure out the issue of changing owners. Also, we 
cannot backport bucketing patch into `branch-2.2`. We need more tests on before 
releasing 2.3.0.

Hive Metastore is a shared resource and Spark should not corrupt it by default. 
This issue proposes to recover that option back to `NEVER_INFO` like Spark 
2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by 
themselves.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-22 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214332#comment-16214332
 ] 

Kazuaki Ishizaki commented on SPARK-22284:
--

Yes, there was a similar ticket that was solved. Your case is more complicated 
case that includes `struct`. I created a repro in my environment. I think that 
it could be solvable. I will create a PR this week.

To fix a problem, we usually submit a PR to the master branch. Then, we 
consider whether the PR could be applicable to previous versions based on risk 
assessments by committers.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2620) case class cannot be used as key for reduce

2017-10-22 Thread Maciej Szymkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-2620:
--
Affects Version/s: 2.2.0

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.0, 1.1.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 
> 2.2.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Assignee: Tobias Schlatter
>Priority: Critical
>  Labels: case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17902:


Assignee: (was: Apache Spark)

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214263#comment-16214263
 ] 

Apache Spark commented on SPARK-17902:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19551

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17902:


Assignee: Apache Spark

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22142) Move Flume support behind a profile

2017-10-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214238#comment-16214238
 ] 

Hyukjin Kwon commented on SPARK-22142:
--

Let me leave a link to dev mailing list - 
http://apache-spark-developers-list.1001551.n3.nabble.com/Should-Flume-integration-be-behind-a-profile-td22535.html

> Move Flume support behind a profile
> ---
>
> Key: SPARK-22142
> URL: https://issues.apache.org/jira/browse/SPARK-22142
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> Kafka 0.8 support was recently put behind a profile. YARN, Mesos, Kinesis, 
> Docker-related integration are behind profiles. Flume support seems like it 
> could as well, to make it opt-in for builds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22272) killing task may cause the executor progress hang because of the JVM bug

2017-10-22 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214220#comment-16214220
 ] 

roncenzhao commented on SPARK-22272:


I think the simple way is to set 'spark.file.transferTo' false. I do it in our 
production env and the problem is never seen again.

> killing task may cause the executor progress hang because of the JVM bug
> 
>
> Key: SPARK-22272
> URL: https://issues.apache.org/jira/browse/SPARK-22272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2
> Environment: java version "1.7.0_75"
> hadoop version 2.5.0
>Reporter: roncenzhao
> Attachments: 26883.jstack, screenshot-1.png, screenshot-2.png
>
>
> JVM bug: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8132693
> We kill the task using 'Thread.interrupt()' and the ShuffleMapTask use nio to 
> merge all partitions files when 'spark.file.transferTo' is true(default), so 
> it may cause the jvm bug.
> When the driver send one task to this bad executor, the task will never run 
> and as a result the job will hang forever without handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org