[jira] [Commented] (SPARK-16188) Spark sql create a lot of small files
[ https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131796#comment-16131796 ] cen yuhai commented on SPARK-16188: --- [~xianlongZhang] yes, you are right, I has implemented this feature by adding repartition by rand() when is at analyzing phase. > Spark sql create a lot of small files > - > > Key: SPARK-16188 > URL: https://issues.apache.org/jira/browse/SPARK-16188 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 > Environment: spark 1.6.1 >Reporter: cen yuhai > > I find that spark sql will create files as many as partition size. When the > results are small, there will be too many small files and most of them are > empty. > Hive have a function to detect the avg of file size. If avg file size is > smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge > files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2
[ https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131794#comment-16131794 ] Sergey Serebryakov commented on SPARK-21782: Your understanding is correct. Either reusing the same {{Random}} instance multiple times (not really an option as shuffle is parallel), using a better RNG, or substantially scrambling the seed (hashing?) will help. Changing the "smearing" algorithm would also work, e.g. to something like this: {code} val distributePartition = (index: Int, items: Iterator[T]) => { val rng = new Random(index) items.map { t => (rng.nextInt(numPartitions), t) } } : Iterator[(Int, T)] {code} Please let me know which way you'd like to see it. > Repartition creates skews when numPartitions is a power of 2 > > > Key: SPARK-21782 > URL: https://issues.apache.org/jira/browse/SPARK-21782 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sergey Serebryakov > Labels: repartition > Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png > > > *Problem:* > When an RDD (particularly with a low item-per-partition ratio) is > repartitioned to {{numPartitions}} = power of 2, the resulting partitions are > very uneven-sized. This affects both {{repartition()}} and > {{coalesce(shuffle=true)}}. > *Steps to reproduce:* > {code} > $ spark-shell > scala> sc.parallelize(0 until 1000, > 250).repartition(64).glom().map(_.length).collect() > res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) > {code} > *Explanation:* > Currently, the [algorithm for > repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450] > (shuffle-enabled coalesce) is as follows: > - for each initial partition {{index}}, generate {{position}} as {{(new > Random(index)).nextInt(numPartitions)}} > - then, for element number {{k}} in initial partition {{index}}, put it in > the new partition {{position + k}} (modulo {{numPartitions}}). > So, essentially elements are smeared roughly equally over {{numPartitions}} > buckets - starting from the one with number {{position+1}}. > Note that a new instance of {{Random}} is created for every initial partition > {{index}}, with a fixed seed {{index}}, and then discarded. So the > {{position}} is deterministic for every {{index}} for any RDD in the world. > Also, [{{nextInt(bound)}} > implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393] > has a special case when {{bound}} is a power of 2, which is basically taking > several highest bits from the initial seed, with only a minimal scrambling. > Due to deterministic seed, using the generator only once, and lack of > scrambling, the {{position}} values for power-of-two {{numPartitions}} always > end up being almost the same regardless of the {{index}}, causing some > buckets to be much more popular than others. So, {{repartition}} will in fact > intentionally produce skewed partitions even when before the partition were > roughly equal in size. > The behavior seems to have been introduced in SPARK-1770 by > https://github.com/apache/spark/pull/727/ > {quote} > The load balancing is not perfect: a given output partition > can have up to N more elements than the average if there are N input > partitions. However, some randomization is used to minimize the > probabiliy that this happens. > {quote} > Another related ticket: SPARK-17817 - > https://github.com/apache/spark/pull/15445 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21771) SparkSQLEnv creates a useless meta hive client
[ https://issues.apache.org/jira/browse/SPARK-21771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21771: -- Issue Type: Improvement (was: Bug) > SparkSQLEnv creates a useless meta hive client > -- > > Key: SPARK-21771 > URL: https://issues.apache.org/jira/browse/SPARK-21771 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kent Yao >Priority: Minor > > Once a meta hive client is created, it generates its SessionState which > creates a lot of session related directories, some deleteOnExit, some does > not. if a hive client is useless we may not create it at the very start. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2
[ https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131784#comment-16131784 ] Sean Owen commented on SPARK-21782: --- Is the problem summary just: with a power of 2 bound, similar seeds give similar output? Isn't that solved with a better RNG or just simple scrambling of the seed bits? The seed is actually essential. > Repartition creates skews when numPartitions is a power of 2 > > > Key: SPARK-21782 > URL: https://issues.apache.org/jira/browse/SPARK-21782 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sergey Serebryakov > Labels: repartition > Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png > > > *Problem:* > When an RDD (particularly with a low item-per-partition ratio) is > repartitioned to {{numPartitions}} = power of 2, the resulting partitions are > very uneven-sized. This affects both {{repartition()}} and > {{coalesce(shuffle=true)}}. > *Steps to reproduce:* > {code} > $ spark-shell > scala> sc.parallelize(0 until 1000, > 250).repartition(64).glom().map(_.length).collect() > res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) > {code} > *Explanation:* > Currently, the [algorithm for > repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450] > (shuffle-enabled coalesce) is as follows: > - for each initial partition {{index}}, generate {{position}} as {{(new > Random(index)).nextInt(numPartitions)}} > - then, for element number {{k}} in initial partition {{index}}, put it in > the new partition {{position + k}} (modulo {{numPartitions}}). > So, essentially elements are smeared roughly equally over {{numPartitions}} > buckets - starting from the one with number {{position+1}}. > Note that a new instance of {{Random}} is created for every initial partition > {{index}}, with a fixed seed {{index}}, and then discarded. So the > {{position}} is deterministic for every {{index}} for any RDD in the world. > Also, [{{nextInt(bound)}} > implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393] > has a special case when {{bound}} is a power of 2, which is basically taking > several highest bits from the initial seed, with only a minimal scrambling. > Due to deterministic seed, using the generator only once, and lack of > scrambling, the {{position}} values for power-of-two {{numPartitions}} always > end up being almost the same regardless of the {{index}}, causing some > buckets to be much more popular than others. So, {{repartition}} will in fact > intentionally produce skewed partitions even when before the partition were > roughly equal in size. > The behavior seems to have been introduced in SPARK-1770 by > https://github.com/apache/spark/pull/727/ > {quote} > The load balancing is not perfect: a given output partition > can have up to N more elements than the average if there are N input > partitions. However, some randomization is used to minimize the > probabiliy that this happens. > {quote} > Another related ticket: SPARK-17817 - > https://github.com/apache/spark/pull/15445 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
[ https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131775#comment-16131775 ] Sean Owen commented on SPARK-21770: --- What's the current behavior for the prediction? I feel like we established this behavior on purpose. If all 0 is the output it should not be modified. The question that matters is what is predicted. > ProbabilisticClassificationModel: Improve normalization of all-zero raw > predictions > --- > > Key: SPARK-21770 > URL: https://issues.apache.org/jira/browse/SPARK-21770 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Minor > > Given an n-element raw prediction vector of all-zeros, > ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output > a probability vector of all-equal 1/n entries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16188) Spark sql create a lot of small files
[ https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131772#comment-16131772 ] xianlongZhang commented on SPARK-16188: --- cen yuhai,thanks for your advice, but my company's data platform processed tens of thousands of sql query every day , it is not practical to modify each sql , so I think add a common mechanism to deal with this problem is the most suitable solution > Spark sql create a lot of small files > - > > Key: SPARK-16188 > URL: https://issues.apache.org/jira/browse/SPARK-16188 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 > Environment: spark 1.6.1 >Reporter: cen yuhai > > I find that spark sql will create files as many as partition size. When the > results are small, there will be too many small files and most of them are > empty. > Hive have a function to detect the avg of file size. If avg file size is > smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge > files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21776. --- Resolution: Invalid > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Documentation, Input/Output, Spark Core >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 >Priority: Trivial > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21776: -- Priority: Trivial (was: Blocker) > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Documentation, Input/Output, Spark Core >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 >Priority: Trivial > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2
[ https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Serebryakov updated SPARK-21782: --- Attachment: Screen Shot 2017-08-16 at 3.40.01 PM.png Distribution of partition sizes (in bytes) spotted in the wild. Horizontal axis: partition index ({{0..1023}}). Vertical axis: partition size in bytes. > Repartition creates skews when numPartitions is a power of 2 > > > Key: SPARK-21782 > URL: https://issues.apache.org/jira/browse/SPARK-21782 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sergey Serebryakov > Labels: repartition > Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png > > > *Problem:* > When an RDD (particularly with a low item-per-partition ratio) is > repartitioned to {{numPartitions}} = power of 2, the resulting partitions are > very uneven-sized. This affects both {{repartition()}} and > {{coalesce(shuffle=true)}}. > *Steps to reproduce:* > {code} > $ spark-shell > scala> sc.parallelize(0 until 1000, > 250).repartition(64).glom().map(_.length).collect() > res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) > {code} > *Explanation:* > Currently, the [algorithm for > repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450] > (shuffle-enabled coalesce) is as follows: > - for each initial partition {{index}}, generate {{position}} as {{(new > Random(index)).nextInt(numPartitions)}} > - then, for element number {{k}} in initial partition {{index}}, put it in > the new partition {{position + k}} (modulo {{numPartitions}}). > So, essentially elements are smeared roughly equally over {{numPartitions}} > buckets - starting from the one with number {{position+1}}. > Note that a new instance of {{Random}} is created for every initial partition > {{index}}, with a fixed seed {{index}}, and then discarded. So the > {{position}} is deterministic for every {{index}} for any RDD in the world. > Also, [{{nextInt(bound)}} > implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393] > has a special case when {{bound}} is a power of 2, which is basically taking > several highest bits from the initial seed, with only a minimal scrambling. > Due to deterministic seed, using the generator only once, and lack of > scrambling, the {{position}} values for power-of-two {{numPartitions}} always > end up being almost the same regardless of the {{index}}, causing some > buckets to be much more popular than others. So, {{repartition}} will in fact > intentionally produce skewed partitions even when before the partition were > roughly equal in size. > The behavior seems to have been introduced in SPARK-1770 by > https://github.com/apache/spark/pull/727/ > {quote} > The load balancing is not perfect: a given output partition > can have up to N more elements than the average if there are N input > partitions. However, some randomization is used to minimize the > probabiliy that this happens. > {quote} > Another related ticket: SPARK-17817 - > https://github.com/apache/spark/pull/15445 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2
Sergey Serebryakov created SPARK-21782: -- Summary: Repartition creates skews when numPartitions is a power of 2 Key: SPARK-21782 URL: https://issues.apache.org/jira/browse/SPARK-21782 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Sergey Serebryakov *Problem:* When an RDD (particularly with a low item-per-partition ratio) is repartitioned to {{numPartitions}} = power of 2, the resulting partitions are very uneven-sized. This affects both {{repartition()}} and {{coalesce(shuffle=true)}}. *Steps to reproduce:* {code} $ spark-shell scala> sc.parallelize(0 until 1000, 250).repartition(64).glom().map(_.length).collect() res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) {code} *Explanation:* Currently, the [algorithm for repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450] (shuffle-enabled coalesce) is as follows: - for each initial partition {{index}}, generate {{position}} as {{(new Random(index)).nextInt(numPartitions)}} - then, for element number {{k}} in initial partition {{index}}, put it in the new partition {{position + k}} (modulo {{numPartitions}}). So, essentially elements are smeared roughly equally over {{numPartitions}} buckets - starting from the one with number {{position+1}}. Note that a new instance of {{Random}} is created for every initial partition {{index}}, with a fixed seed {{index}}, and then discarded. So the {{position}} is deterministic for every {{index}} for any RDD in the world. Also, [{{nextInt(bound)}} implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393] has a special case when {{bound}} is a power of 2, which is basically taking several highest bits from the initial seed, with only a minimal scrambling. Due to deterministic seed, using the generator only once, and lack of scrambling, the {{position}} values for power-of-two {{numPartitions}} always end up being almost the same regardless of the {{index}}, causing some buckets to be much more popular than others. So, {{repartition}} will in fact intentionally produce skewed partitions even when before the partition were roughly equal in size. The behavior seems to have been introduced in SPARK-1770 by https://github.com/apache/spark/pull/727/ {quote} The load balancing is not perfect: a given output partition can have up to N more elements than the average if there are N input partitions. However, some randomization is used to minimize the probabiliy that this happens. {quote} Another related ticket: SPARK-17817 - https://github.com/apache/spark/pull/15445 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21739. - Resolution: Fixed Assignee: Feng Zhu Fix Version/s: 2.3.0 2.2.1 > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: wangzhihao >Assignee: Feng Zhu >Priority: Critical > Fix For: 2.2.1, 2.3.0 > > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The root cause is that TableReader.scala#230 try to cast the string to > timestamp regardless if the timeZone exists. > Here is the error stack trace > {code} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) > at > org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) > >at > org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131749#comment-16131749 ] zhaP524 edited comment on SPARK-21776 at 8/18/17 5:36 AM: -- [~kiszk] I see , I have changed the type of question,This problem has caused my business to be suspended was (Author: 扎啤): @Kazuaki Ishizaki I see , I have changed the type of question,This problem has caused my business to be suspended > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Documentation, Input/Output, Spark Core >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 >Priority: Blocker > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131749#comment-16131749 ] zhaP524 commented on SPARK-21776: - @Kazuaki Ishizaki I see , I have changed the type of question,This problem has caused my business to be suspended > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Documentation, Input/Output, Spark Core >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 >Priority: Blocker > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaP524 updated SPARK-21776: Priority: Blocker (was: Major) Issue Type: Improvement (was: Bug) > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Documentation, Input/Output, Spark Core >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 >Priority: Blocker > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21781) Modify DataSourceScanExec to use concrete ColumnVector type.
Takuya Ueshin created SPARK-21781: - Summary: Modify DataSourceScanExec to use concrete ColumnVector type. Key: SPARK-21781 URL: https://issues.apache.org/jira/browse/SPARK-21781 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takuya Ueshin As mentioned at https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have more {{ColumnVector}} implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches. As for read path, one of the major paths is the one generated by {{ColumnBatchScan}}. Currently it refers {{ColumnVector}} so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses {{OnHeapColumnVector}}. We can use the concrete type in the generated code directly to avoid the penalty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21778) Simpler Dataset.sample API in Scala / Java
[ https://issues.apache.org/jira/browse/SPARK-21778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-21778: Summary: Simpler Dataset.sample API in Scala / Java (was: Simpler Dataset.sample API in Scala) > Simpler Dataset.sample API in Scala / Java > -- > > Key: SPARK-21778 > URL: https://issues.apache.org/jira/browse/SPARK-21778 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21779) Simpler Dataset.sample API in Python
Reynold Xin created SPARK-21779: --- Summary: Simpler Dataset.sample API in Python Key: SPARK-21779 URL: https://issues.apache.org/jira/browse/SPARK-21779 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Reynold Xin See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21780) Simpler Dataset.sample API in R
Reynold Xin created SPARK-21780: --- Summary: Simpler Dataset.sample API in R Key: SPARK-21780 URL: https://issues.apache.org/jira/browse/SPARK-21780 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Reynold Xin See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21778) Simpler Dataset.sample API in Scala
Reynold Xin created SPARK-21778: --- Summary: Simpler Dataset.sample API in Scala Key: SPARK-21778 URL: https://issues.apache.org/jira/browse/SPARK-21778 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Reynold Xin See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21777) Simpler Dataset.sample API
Reynold Xin created SPARK-21777: --- Summary: Simpler Dataset.sample API Key: SPARK-21777 URL: https://issues.apache.org/jira/browse/SPARK-21777 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: Reynold Xin Assignee: Reynold Xin Dataset.sample requires a boolean flag withReplacement as the first argument. However, most of the time users simply want to sample some records without replacement. This ticket introduces a new sample function that simply takes in the fraction and seed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - External issue URL: (was: https://github.com/apache/spark/pull/18986) > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - External issue URL: https://github.com/apache/spark/pull/18986 > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - External issue ID: (was: SPARK-21646) > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - Description: Data {code} create temporary view tb as select * from values ("0", 1), ("-0.1", 2), ("1", 3) as grouping(a, b) {code} SQL: {code} select a, b from tb where a=0 {code} The result which is wrong: {code} ++---+ | a| b| ++---+ | 0| 1| |-0.1| 2| ++---+ {code} Logical Plan: {code} == Parsed Logical Plan == 'Project ['a] +- 'Filter ('a = 0) +- 'UnresolvedRelation `src` == Analyzed Logical Plan == a: string Project [a#8528] +- Filter (cast(a#8528 as int) = 0) +- SubqueryAlias src +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] +- LocalRelation [_1#8525, _2#8526] {code} was: Data {code} create temporary view tb as select * from values ("0", 1), ("-0.1", 2), ("1", 3) as grouping(a, b) {code} SQL: {code} select a, b from tb where a=0 {code} The result which is wrong: {code} ++---+ | a| b| ++---+ | 0| 1| |-0.1| 2| ++---+ {code} > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131681#comment-16131681 ] Kazuaki Ishizaki commented on SPARK-21776: -- Is this a question? It this is a kind of questions, it would be good to send a message to u...@spark.apache.org OR d...@spark.apache.org. > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Bug > Components: Block Manager, Documentation, Input/Output, Spark Core >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaP524 updated SPARK-21776: Component/s: Spark Core Issue Type: Bug (was: Question) > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Bug > Components: Block Manager, Documentation, Input/Output, Spark Core >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaP524 updated SPARK-21776: Description: In generation, we have to use the Spark full quantity loaded HBase table based on one dimension table to generate business, because the base table is total quantity loaded, the memory will pressure is very big, I want to see if the Spark can use this way to deal with memory mapped file?Is there such a mechanism?How do you use it? And I found in the Spark a parameter: spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is used for? There is a putBytes and getBytes method in DiskStore.scala with Spark source code, is this the memory-mapped file mentioned above?How to understand? Let me know if you have any trouble.. Wish to You!! was: In generation, we have to use the Spark full quantity loaded HBase table based on one dimension table to generate business, because the base table is total quantity loaded, the memory will pressure is very big, I want to see if the Spark can use this way to deal with memory mapped file?Is there such a mechanism?How do you use it? And I found in the Spark a parameter: spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is used for? There is a putBytes and getBytes method in DiskStore.scala with Spark source code, is this the memory-mapped file mentioned above?How to understand?? Let me know if you have any trouble.. Wish to You!! Summary: How to use the memory-mapped file on Spark?? (was: How to use the memory-mapped file on Spark???) > How to use the memory-mapped file on Spark?? > > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Question > Components: Block Manager, Documentation, Input/Output >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to understand? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark???
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131672#comment-16131672 ] zhaP524 commented on SPARK-21776: - !screenshot-2.png! > How to use the memory-mapped file on Spark??? > - > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Question > Components: Block Manager, Documentation, Input/Output >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to > understand?? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark???
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131671#comment-16131671 ] zhaP524 commented on SPARK-21776: - !screenshot-1.png! > How to use the memory-mapped file on Spark??? > - > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Question > Components: Block Manager, Documentation, Input/Output >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 > Attachments: screenshot-1.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to > understand?? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark???
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaP524 updated SPARK-21776: Attachment: screenshot-1.png > How to use the memory-mapped file on Spark??? > - > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Question > Components: Block Manager, Documentation, Input/Output >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 > Attachments: screenshot-1.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to > understand?? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark???
[ https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaP524 updated SPARK-21776: Attachment: screenshot-2.png > How to use the memory-mapped file on Spark??? > - > > Key: SPARK-21776 > URL: https://issues.apache.org/jira/browse/SPARK-21776 > Project: Spark > Issue Type: Question > Components: Block Manager, Documentation, Input/Output >Affects Versions: 2.1.1 > Environment: Spark 2.1.1 > Scala 2.11.8 >Reporter: zhaP524 > Attachments: screenshot-1.png, screenshot-2.png > > > In generation, we have to use the Spark full quantity loaded HBase > table based on one dimension table to generate business, because the base > table is total quantity loaded, the memory will pressure is very big, I want > to see if the Spark can use this way to deal with memory mapped file?Is there > such a mechanism?How do you use it? > And I found in the Spark a parameter: > spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is > used for? >There is a putBytes and getBytes method in DiskStore.scala with Spark > source code, is this the memory-mapped file mentioned above?How to > understand?? >Let me know if you have any trouble.. > Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21776) How to use the memory-mapped file on Spark???
zhaP524 created SPARK-21776: --- Summary: How to use the memory-mapped file on Spark??? Key: SPARK-21776 URL: https://issues.apache.org/jira/browse/SPARK-21776 Project: Spark Issue Type: Question Components: Block Manager, Documentation, Input/Output Affects Versions: 2.1.1 Environment: Spark 2.1.1 Scala 2.11.8 Reporter: zhaP524 In generation, we have to use the Spark full quantity loaded HBase table based on one dimension table to generate business, because the base table is total quantity loaded, the memory will pressure is very big, I want to see if the Spark can use this way to deal with memory mapped file?Is there such a mechanism?How do you use it? And I found in the Spark a parameter: spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is used for? There is a putBytes and getBytes method in DiskStore.scala with Spark source code, is this the memory-mapped file mentioned above?How to understand?? Let me know if you have any trouble.. Wish to You!! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5073) "spark.storage.memoryMapThreshold" has two default values
[ https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131667#comment-16131667 ] zhaP524 commented on SPARK-5073: I wonder what this parameter is for?Also want to know if this parameter is related to the memory-mapped file?I hope You can tell you the trouble. > "spark.storage.memoryMapThreshold" has two default values > - > > Key: SPARK-5073 > URL: https://issues.apache.org/jira/browse/SPARK-5073 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Jianhui Yuan >Priority: Minor > > In org.apache.spark.storage.DiskStore: > val minMemoryMapBytes = > blockManager.conf.getLong("spark.storage.memoryMapThreshold", 2 * 4096L) > In org.apache.spark.network.util.TransportConf: > public int memoryMapBytes() { > return conf.getInt("spark.storage.memoryMapThreshold", 2 * 1024 * > 1024); > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21775) Dynamic Log Level Settings for executors
[ https://issues.apache.org/jira/browse/SPARK-21775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LvDongrong updated SPARK-21775: --- Attachment: web.PNG terminal.PNG I changed the loglevel of driver to debug and take effects, so as other executors. > Dynamic Log Level Settings for executors > > > Key: SPARK-21775 > URL: https://issues.apache.org/jira/browse/SPARK-21775 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.2.0 >Reporter: LvDongrong > Attachments: terminal.PNG, web.PNG > > > Someimes we want to change the log level of executor when our application has > already deployed, to see detail infomation or decrease the log items. > Changing the log4j configure file is not convenient,so We add the ability to > set log level settings for a running executor. It can also reverts the log > level back to what it was before you added the setting. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21775) Dynamic Log Level Settings for executors
LvDongrong created SPARK-21775: -- Summary: Dynamic Log Level Settings for executors Key: SPARK-21775 URL: https://issues.apache.org/jira/browse/SPARK-21775 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Affects Versions: 2.2.0 Reporter: LvDongrong Someimes we want to change the log level of executor when our application has already deployed, to see detail infomation or decrease the log items. Changing the log4j configure file is not convenient,so We add the ability to set log level settings for a running executor. It can also reverts the log level back to what it was before you added the setting. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
StanZhai created SPARK-21774: Summary: The rule PromoteStrings cast string to a wrong data type Key: SPARK-21774 URL: https://issues.apache.org/jira/browse/SPARK-21774 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: StanZhai Priority: Critical Data {code} create temporary view tb as select * from values ("0", 1), ("-0.1", 2), ("1", 3) as grouping(a, b) {code} SQL: {code} select a, b from tb where a=0 {code} The result which is wrong: {code} ++---+ | a| b| ++---+ | 0| 1| |-0.1| 2| ++---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21773) Should Install mkdocs if missing in the path in SQL documentation build
Hyukjin Kwon created SPARK-21773: Summary: Should Install mkdocs if missing in the path in SQL documentation build Key: SPARK-21773 URL: https://issues.apache.org/jira/browse/SPARK-21773 Project: Spark Issue Type: Improvement Components: Build, Documentation Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Priority: Minor Currently, SQL documentation build with Jekins looks being failed due to missing {{mkdocs}}. For example, see https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console {code} ... Moving back into docs dir. Moving to SQL directory and building docs. Missing mkdocs in your path, skipping SQL documentation generation. Moving back into docs dir. Making directory api/sql cp -r ../sql/site/. api/sql jekyll 2.5.3 | Error: unknown file type: ../sql/site/. Deleting credential directory /home/jenkins/workspace/spark-master-docs/spark-utils/new-release-scripts/jenkins/jenkins-credentials-scUXuITy Build step 'Execute shell' marked build as failure [WS-CLEANUP] Deleting project workspace...[WS-CLEANUP] done Finished: FAILURE {code} It looks we better install it if missing rather than fail it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21772) HiveException unable to move results from srcf to destf in InsertIntoHiveTable
liupengcheng created SPARK-21772: Summary: HiveException unable to move results from srcf to destf in InsertIntoHiveTable Key: SPARK-21772 URL: https://issues.apache.org/jira/browse/SPARK-21772 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.1 Environment: JDK1.7 CentOS 6.3 Spark2.1 Reporter: liupengcheng Currently, when execute {code:java} create table as select {code} would return Exception: {code:java} 2017-08-17,16:14:18,792 ERROR org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_12.loadTable(HiveShim.scala:346) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:770) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:770) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:770) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:316) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:262) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:261) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305) at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:769) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:765) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:763) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:763) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100) at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:763) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:323) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:119) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:92) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:119) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.Dataset.(Dataset.scala:187) at org.apache.spark.sql.Dataset$.ofR
[jira] [Created] (SPARK-21771) SparkSQLEnv creates a useless meta hive client
Kent Yao created SPARK-21771: Summary: SparkSQLEnv creates a useless meta hive client Key: SPARK-21771 URL: https://issues.apache.org/jira/browse/SPARK-21771 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kent Yao Priority: Minor Once a meta hive client is created, it generates its SessionState which creates a lot of session related directories, some deleteOnExit, some does not. if a hive client is useless we may not create it at the very start. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
[ https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21770: --- Description: Given an n-element raw prediction vector of all-zeros, ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a probability vector of all-equal 1/n entries (was: Given a raw prediction vector of all-zeros, ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a probability vector that predicts each class with equal probability (1 / numClasses).) > ProbabilisticClassificationModel: Improve normalization of all-zero raw > predictions > --- > > Key: SPARK-21770 > URL: https://issues.apache.org/jira/browse/SPARK-21770 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Minor > > Given an n-element raw prediction vector of all-zeros, > ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output > a probability vector of all-equal 1/n entries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
Siddharth Murching created SPARK-21770: -- Summary: ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions Key: SPARK-21770 URL: https://issues.apache.org/jira/browse/SPARK-21770 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Siddharth Murching Priority: Minor Given a raw prediction vector of all-zeros, ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a probability vector that predicts each class with equal probability (1 / numClasses). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537 ] George Pongracz edited comment on SPARK-21702 at 8/18/17 12:55 AM: --- *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". Data bearing files if written without using "PartitionBy", when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". All related non-data bearing files, irrespective whether "PartitionBy" has been or not been used when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, when "PartitionBy" has been used, brings up a dedicated overview screen for the file, reports it as having AES-256 encryption, which differs from how its reported with encryption "-" in the parent screen and selected using its LHS check-box. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. I must say I spun my wheels for a time believing I had not encrypted and trying to debug until I stumbled upon what I just described in this update. Thanks for the advice about the s3a.impl field :) was (Author: gpongracz): *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". Data bearing files if written without using "PartitionBy", when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". All related non-data bearing files, irrespective whether "PartitionBy" has been or not been used when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, when "PartitionBy" has been used, brings up a dedicated overview screen for the file, reports it as having AES-256 encryption, which differs from how its reported with encryption "-" in the parent screen and selected using its LHS check-box. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. I must say I spun my wheels for a time believing I had not encrypted and trying to debug until I stumbled upon what I just described in this update. Thanks for the advice about the s3a.impl field > Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when > PartitionBy Used > > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537 ] George Pongracz edited comment on SPARK-21702 at 8/18/17 12:55 AM: --- *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". Data bearing files if written without using "PartitionBy", when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". All related non-data bearing files, irrespective whether "PartitionBy" has been or not been used when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, when "PartitionBy" has been used, brings up a dedicated overview screen for the file, reports it as having AES-256 encryption, which differs from how its reported with encryption "-" in the parent screen and selected using its LHS check-box. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. I must say I spun my wheels for a time believing I had not encrypted and trying to debug until I stumbled upon what I just described in this update. Thanks for the advice about the s3a.impl field was (Author: gpongracz): *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". Data bearing files if written without using "PartitionBy", when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". All related non-data bearing files, irrespective whether "PartitionBy" has been or not been used when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, when "PartitionBy" has been used, brings up a dedicated overview screen for the file, reports it as having AES-256 encryption, which differs from how its reported with encryption "-" in the parent screen and selected using its LHS check-box. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. I must say I spun my wheels for a time believing I had not encrypted and trying to debug until I stumbled upon what I just described in this update. > Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when > PartitionBy Used > > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537 ] George Pongracz edited comment on SPARK-21702 at 8/18/17 12:54 AM: --- *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". Data bearing files if written without using "PartitionBy", when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". All related non-data bearing files, irrespective whether "PartitionBy" has been or not been used when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, when "PartitionBy" has been used, brings up a dedicated overview screen for the file, reports it as having AES-256 encryption, which differs from how its reported with encryption "-" in the parent screen and selected using its LHS check-box. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. I must say I spun my wheels for a time believing I had not encrypted and trying to debug until I stumbled upon what I just described in this update. was (Author: gpongracz): *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". All related non-data bearing files when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, which brings up a dedicated overview screen for the file, reports it as having AES-256 encryption. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. I must say I lost a bit of time believing I had not encrypted and tried to debug until I stumbled upon what I just described in this update. Obviously only happening when PartitionBy is used. > Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when > PartitionBy Used > > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537 ] George Pongracz edited comment on SPARK-21702 at 8/18/17 12:47 AM: --- *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". All related non-data bearing files when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, which brings up a dedicated overview screen for the file, reports it as having AES-256 encryption. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. I must say I lost a bit of time believing I had not encrypted and tried to debug until I stumbled upon what I just described in this update. Obviously only happening when PartitionBy is used. was (Author: gpongracz): *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". All related non-data bearing files when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, which brings up a dedicated overview screen for the file, reports it as having AES-256 encryption. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. Obviously only happening when PartitionBy is used. > Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when > PartitionBy Used > > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pongracz updated SPARK-21702: Summary: Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used (was: Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used) > Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when > PartitionBy Used > > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537 ] George Pongracz edited comment on SPARK-21702 at 8/18/17 12:43 AM: --- *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section report their encryption as "-". All related non-data bearing files when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, which brings up a dedicated overview screen for the file, reports it as having AES-256 encryption. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection seems unencrypted, whilst really the files on deeper via click-through report as encrypted. I think this lowers the weight of this issue and I can close if deemed a non issue, however it would be good if the files would written would all present consistently and correctly, whether data or non-data bearing. Obviously only happening when PartitionBy is used. was (Author: gpongracz): *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section as "-". All related non-data bearing files when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, which brings up a dedicated overview screen for the file, reports it as having AES-256 encryption. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection is unencrypted. The good news is that the files are all encrypted underneath even if not appearing so at initial inspection though the AWS S3 GUI. I think this lowers the priority of this iss and I can close if deemed a non issue - please advise. > Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used > - > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537 ] George Pongracz commented on SPARK-21702: - *Update:* The data bearing files (files that contain the data payload from the stream) written to s3 when viewed through the AWS S3 GUI and selected using their LHS check-box encryption in the properties section as "-". All related non-data bearing files when selected using their LHS check-box encryption in the properties section report their encryption as "AES-256". When clicking through the name of a single data bearing file, which brings up a dedicated overview screen for the file, reports it as having AES-256 encryption. As one can see, this labelling of encryption is inconsistent and can cause confusion that a file on first inspection is unencrypted. The good news is that the files are all encrypted underneath even if not appearing so at initial inspection though the AWS S3 GUI. I think this lowers the priority of this iss and I can close if deemed a non issue - please advise. > Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used > - > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark
[ https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131512#comment-16131512 ] Bryan Cutler commented on SPARK-21685: -- I believe the problem is during the call to transform, the PySpark model does not differentiate between set and default params and then sets them all in Java. I have submitting a fix at https://github.com/apache/spark/pull/18982, could you try that and see if it works for you? > Params isSet in scala Transformer triggered by _setDefault in pyspark > - > > Key: SPARK-21685 > URL: https://issues.apache.org/jira/browse/SPARK-21685 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ratan Rai Sur > > I'm trying to write a PySpark wrapper for a Transformer whose transform > method includes the line > {code:java} > require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both > outputNodeName and outputNodeIndex") > {code} > This should only throw an exception when both of these parameters are > explicitly set. > In the PySpark wrapper for the Transformer, there is this line in ___init___ > {code:java} > self._setDefault(outputNodeIndex=0) > {code} > Here is the line in the main python script showing how it is being configured > {code:java} > cntkModel = > CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark, > model.uri).setOutputNodeName("z") > {code} > As you can see, only setOutputNodeName is being explicitly set but the > exception is still being thrown. > If you need more context, > https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the > branch with the code, the files I'm referring to here that are tracked are > the following: > src/cntk-model/src/main/scala/CNTKModel.scala > notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb > The pyspark wrapper code is autogenerated -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
[ https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131493#comment-16131493 ] Liang-Chi Hsieh commented on SPARK-21759: - Submitted PR at https://github.com/apache/spark/pull/18968 > In.checkInputDataTypes should not wrongly report unresolved plans for IN > correlated subquery > > > Key: SPARK-21759 > URL: https://issues.apache.org/jira/browse/SPARK-21759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > With the check for structural integrity proposed in SPARK-21726, I found that > an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved > plans. > For a correlated IN query like: > {code} > Project [a#0] > +- Filter a#0 IN (list#4 [b#1]) >: +- Project [c#2] >: +- Filter (outer(b#1) < d#3) >:+- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > After {{PullupCorrelatedPredicates}}, it produces query plan like: > {code} > 'Project [a#0] > +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) >: +- Project [c#2, d#3] >: +- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > Because the correlated predicate involves another attribute {{d#3}} in > subquery, it has been pulled out and added into the {{Project}} on the top of > the subquery. > When {{list}} in {{In}} contains just one {{ListQuery}}, > {{In.checkInputDataTypes}} checks if the size of {{value}} expressions > matches the output size of subquery. In the above example, there is only > {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, > so it fails the check and {{In.resolved}} returns {{false}}. > We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans > to fail the structural integrity check. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
[ https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-21759: Description: With the check for structural integrity proposed in SPARK-21726, I found that an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved plans. For a correlated IN query like: {code} Project [a#0] +- Filter a#0 IN (list#4 [b#1]) : +- Project [c#2] : +- Filter (outer(b#1) < d#3) :+- LocalRelation , [c#2, d#3] +- LocalRelation , [a#0, b#1] {code} After {{PullupCorrelatedPredicates}}, it produces query plan like: {code} 'Project [a#0] +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) : +- Project [c#2, d#3] : +- LocalRelation , [c#2, d#3] +- LocalRelation , [a#0, b#1] {code} Because the correlated predicate involves another attribute {{d#3}} in subquery, it has been pulled out and added into the {{Project}} on the top of the subquery. When {{list}} in {{In}} contains just one {{ListQuery}}, {{In.checkInputDataTypes}} checks if the size of {{value}} expressions matches the output size of subquery. In the above example, there is only {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, so it fails the check and {{In.resolved}} returns {{false}}. We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans to fail the structural integrity check. was: With the check for structural integrity proposed in SPARK-21726, I found that an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved plans. For a correlated IN query like: {code} Project [a#0] +- Filter a#0 IN (list#4 [b#1]) : +- Project [c#2] : +- Filter (outer(b#1) < d#3) :+- LocalRelation , [c#2, d#3] +- LocalRelation , [a#0, b#1] {code} After {{PullupCorrelatedPredicates}}, it produces query plan like: {code} 'Project [a#0] +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) : +- Project [c#2, d#3] : +- LocalRelation , [c#2, d#3] +- LocalRelation , [a#0, b#1] {code} Because the correlated predicate involves another attribute {{d#3}} in subquery, it has been pulled out and added into the {{Project}} on the top of the subquery. When {{list}} in {{In}} contains just one {{ListQuery}}, {{In.checkInputDataTypes}} checks if the size of {{value}} expressions matches the output size of subquery. In the above example, there is only {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, so it fails the check and {{In.resolved}} returns {{false}}. We should let {{PullupCorrelatedPredicates}} produce resolved plans to pass the structural integrity check. > In.checkInputDataTypes should not wrongly report unresolved plans for IN > correlated subquery > > > Key: SPARK-21759 > URL: https://issues.apache.org/jira/browse/SPARK-21759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > With the check for structural integrity proposed in SPARK-21726, I found that > an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved > plans. > For a correlated IN query like: > {code} > Project [a#0] > +- Filter a#0 IN (list#4 [b#1]) >: +- Project [c#2] >: +- Filter (outer(b#1) < d#3) >:+- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > After {{PullupCorrelatedPredicates}}, it produces query plan like: > {code} > 'Project [a#0] > +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) >: +- Project [c#2, d#3] >: +- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > Because the correlated predicate involves another attribute {{d#3}} in > subquery, it has been pulled out and added into the {{Project}} on the top of > the subquery. > When {{list}} in {{In}} contains just one {{ListQuery}}, > {{In.checkInputDataTypes}} checks if the size of {{value}} expressions > matches the output size of subquery. In the above example, there is only > {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, > so it fails the check and {{In.resolved}} returns {{false}}. > We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans > to fail the structural integrity check. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
[ https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-21759: Summary: In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery (was: PullupCorrelatedPredicates should not produce unresolved plan) > In.checkInputDataTypes should not wrongly report unresolved plans for IN > correlated subquery > > > Key: SPARK-21759 > URL: https://issues.apache.org/jira/browse/SPARK-21759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > With the check for structural integrity proposed in SPARK-21726, I found that > an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved > plans. > For a correlated IN query like: > {code} > Project [a#0] > +- Filter a#0 IN (list#4 [b#1]) >: +- Project [c#2] >: +- Filter (outer(b#1) < d#3) >:+- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > After {{PullupCorrelatedPredicates}}, it produces query plan like: > {code} > 'Project [a#0] > +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) >: +- Project [c#2, d#3] >: +- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > Because the correlated predicate involves another attribute {{d#3}} in > subquery, it has been pulled out and added into the {{Project}} on the top of > the subquery. > When {{list}} in {{In}} contains just one {{ListQuery}}, > {{In.checkInputDataTypes}} checks if the size of {{value}} expressions > matches the output size of subquery. In the above example, there is only > {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, > so it fails the check and {{In.resolved}} returns {{false}}. > We should let {{PullupCorrelatedPredicates}} produce resolved plans to pass > the structural integrity check. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.
[ https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21677. - Resolution: Fixed Fix Version/s: 2.3.0 > json_tuple throws NullPointException when column is null as string type. > > > Key: SPARK-21677 > URL: https://issues.apache.org/jira/browse/SPARK-21677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Jen-Ming Chung >Priority: Minor > Labels: Starter > Fix For: 2.3.0 > > > I was testing {{json_tuple}} before using this to extract values from JSONs > in my testing cluster but I found it could actually throw > {{NullPointException}} as below sometimes: > {code} > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show() > +---+ > | c0| > +---+ > |224| > +---+ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show() > ++ > | c0| > ++ > |null| > ++ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400) > {code} > It sounds we should show explicit error messages or return {{NULL}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.
[ https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-21677: --- Assignee: Jen-Ming Chung > json_tuple throws NullPointException when column is null as string type. > > > Key: SPARK-21677 > URL: https://issues.apache.org/jira/browse/SPARK-21677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Jen-Ming Chung >Priority: Minor > Labels: Starter > > I was testing {{json_tuple}} before using this to extract values from JSONs > in my testing cluster but I found it could actually throw > {{NullPointException}} as below sometimes: > {code} > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show() > +---+ > | c0| > +---+ > |224| > +---+ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show() > ++ > | c0| > ++ > |null| > ++ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400) > {code} > It sounds we should show explicit error messages or return {{NULL}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21767) Add Decimal Test For Avro in VersionSuite
[ https://issues.apache.org/jira/browse/SPARK-21767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21767. - Resolution: Fixed Fix Version/s: 2.3.0 > Add Decimal Test For Avro in VersionSuite > - > > Key: SPARK-21767 > URL: https://issues.apache.org/jira/browse/SPARK-21767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > Decimal is a logical type of AVRO. We need to ensure the support of Hive's > AVRO serde works well in Spark -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21769) Add a table property for Hive-serde tables to control Spark always respecting schemas inferred by Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-21769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21769: Summary: Add a table property for Hive-serde tables to control Spark always respecting schemas inferred by Spark SQL (was: Add a table property for Hive-serde tables to controlling Spark always respecting schemas inferred by Spark SQL) > Add a table property for Hive-serde tables to control Spark always respecting > schemas inferred by Spark SQL > --- > > Key: SPARK-21769 > URL: https://issues.apache.org/jira/browse/SPARK-21769 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > For Hive-serde tables, we always respect the schema stored in Hive metastore, > because the schema could be altered by the other engines that share the same > metastore. Thus, we always trust the metastore-controlled schema for > Hive-serde tables when the schemas are different (without considering the > nullability and cases). However, in some scenarios, Hive metastore also could > INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in > serde are different. > The proposed solution is to introduce a table property for such scenarios. > For a specific Hive-serde table, users can manually setting such table > property for asking Spark for always respect Spark-inferred schema instead of > trusting metastore-controlled schema. By default, it is off. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21769) Add a table property for Hive-serde tables to controlling Spark always respecting schemas inferred by Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-21769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21769: Issue Type: Improvement (was: Bug) > Add a table property for Hive-serde tables to controlling Spark always > respecting schemas inferred by Spark SQL > --- > > Key: SPARK-21769 > URL: https://issues.apache.org/jira/browse/SPARK-21769 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > For Hive-serde tables, we always respect the schema stored in Hive metastore, > because the schema could be altered by the other engines that share the same > metastore. Thus, we always trust the metastore-controlled schema for > Hive-serde tables when the schemas are different (without considering the > nullability and cases). However, in some scenarios, Hive metastore also could > INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in > serde are different. > The proposed solution is to introduce a table property for such scenarios. > For a specific Hive-serde table, users can manually setting such table > property for asking Spark for always respect Spark-inferred schema instead of > trusting metastore-controlled schema. By default, it is off. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21769) Add a table property for Hive-serde tables to controlling Spark always respecting schemas inferred by Spark SQL
Xiao Li created SPARK-21769: --- Summary: Add a table property for Hive-serde tables to controlling Spark always respecting schemas inferred by Spark SQL Key: SPARK-21769 URL: https://issues.apache.org/jira/browse/SPARK-21769 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different. The proposed solution is to introduce a table property for such scenarios. For a specific Hive-serde table, users can manually setting such table property for asking Spark for always respect Spark-inferred schema instead of trusting metastore-controlled schema. By default, it is off. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21769) Add a table property for Hive-serde tables to make Spark always respect schemas inferred by Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-21769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21769: Summary: Add a table property for Hive-serde tables to make Spark always respect schemas inferred by Spark SQL (was: Add a table property for Hive-serde tables to control Spark always respecting schemas inferred by Spark SQL) > Add a table property for Hive-serde tables to make Spark always respect > schemas inferred by Spark SQL > - > > Key: SPARK-21769 > URL: https://issues.apache.org/jira/browse/SPARK-21769 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > For Hive-serde tables, we always respect the schema stored in Hive metastore, > because the schema could be altered by the other engines that share the same > metastore. Thus, we always trust the metastore-controlled schema for > Hive-serde tables when the schemas are different (without considering the > nullability and cases). However, in some scenarios, Hive metastore also could > INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in > serde are different. > The proposed solution is to introduce a table property for such scenarios. > For a specific Hive-serde table, users can manually setting such table > property for asking Spark for always respect Spark-inferred schema instead of > trusting metastore-controlled schema. By default, it is off. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16742. Resolution: Fixed Assignee: Arthur Rand Fix Version/s: 2.3.0 [~arand] I don't see a specific bug tracking long-running app support (i.e. principal/keytab arguments to spark-submit), which I don't think this patch covered, so you may want to look at filing something to track that. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt >Assignee: Arthur Rand > Fix For: 2.3.0 > > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators
[ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131340#comment-16131340 ] Joseph K. Bradley commented on SPARK-19747: --- Just saying: Thanks a lot for doing this reorg! It's a nice step towards having pluggable algorithms. > Consolidate code in ML aggregators > -- > > Key: SPARK-19747 > URL: https://issues.apache.org/jira/browse/SPARK-19747 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > Many algorithms in Spark ML are posed as optimization of a differentiable > loss function over a parameter vector. We implement these by having a loss > function accumulate the gradient using an Aggregator class which has methods > that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm > that obeys this form implements a cost function class and an aggregator > class, which are completely separate from one another but share probably 80% > of the same code. > I think it is important to clean things like this up, and if we can do it > properly it will make the code much more maintainable, readable, and bug > free. It will also help reduce the overhead of future implementations. > The design is of course open for discussion, but I think we should aim to: > 1. Have all aggregators share parent classes, so that they only need to > implement the {{add}} function. This is really the only difference in the > current aggregators. > 2. Have a single, generic cost function that is parameterized by the > aggregator type. This reduces the many places we implement cost functions and > greatly reduces the amount of duplicated code. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131315#comment-16131315 ] Xiao Li commented on SPARK-4131: https://github.com/apache/spark/pull/18975 > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: Fei Wang >Priority: Critical > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-4131: --- Target Version/s: 2.3.0 > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: Fei Wang >Priority: Critical > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses
[ https://issues.apache.org/jira/browse/SPARK-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18394. --- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.3.0 > Executing the same query twice in a row results in CodeGenerator cache misses > - > > Key: SPARK-18394 > URL: https://issues.apache.org/jira/browse/SPARK-18394 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop >Reporter: Jonny Serencsa >Assignee: Takeshi Yamamuro > Fix For: 2.3.0 > > > Executing the query: > {noformat} > select > l_returnflag, > l_linestatus, > sum(l_quantity) as sum_qty, > sum(l_extendedprice) as sum_base_price, > sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, > avg(l_quantity) as avg_qty, > avg(l_extendedprice) as avg_price, > avg(l_discount) as avg_disc, > count(*) as count_order > from > lineitem_1_row > where > l_shipdate <= date_sub('1998-12-01', '90') > group by > l_returnflag, > l_linestatus > ; > {noformat} > twice (in succession), will result in CodeGenerator cache misses in BOTH > executions. Since the query is identical, I would expect the same code to be > generated. > Turns out, the generated code is not exactly the same, resulting in cache > misses when performing the lookup in the CodeGenerator cache. Yet, the code > is equivalent. > Below is (some portion of the) generated code for two runs of the query: > run-1 > {noformat} > import java.nio.ByteBuffer; > import java.nio.ByteOrder; > import scala.collection.Iterator; > import org.apache.spark.sql.types.DataType; > import org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder; > import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter; > import org.apache.spark.sql.execution.columnar.MutableUnsafeRow; > public SpecificColumnarIterator generate(Object[] references) { > return new SpecificColumnarIterator(); > } > class SpecificColumnarIterator extends > org.apache.spark.sql.execution.columnar.ColumnarIterator { > private ByteOrder nativeOrder = null; > private byte[][] buffers = null; > private UnsafeRow unsafeRow = new UnsafeRow(7); > private BufferHolder bufferHolder = new BufferHolder(unsafeRow); > private UnsafeRowWriter rowWriter = new UnsafeRowWriter(bufferHolder, 7); > private MutableUnsafeRow mutableRow = null; > private int currentRow = 0; > private int numRowsInBatch = 0; > private scala.collection.Iterator input = null; > private DataType[] columnTypes = null; > private int[] columnIndexes = null; > private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor; > private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor > accessor1; > private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor > accessor2; > private org.apache.spark.sql.execution.columnar.StringColumnAccessor > accessor3; > private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor > accessor4; > private org.apache.spark.sql.execution.columnar.StringColumnAccessor > accessor5; > private org.apache.spark.sql.execution.columnar.StringColumnAccessor > accessor6; > public SpecificColumnarIterator() { > this.nativeOrder = ByteOrder.nativeOrder(); > this.buffers = new byte[7][]; > this.mutableRow = new MutableUnsafeRow(rowWriter); > } > public void initialize(Iterator input, DataType[] columnTypes, int[] > columnIndexes) { > this.input = input; > this.columnTypes = columnTypes; > this.columnIndexes = columnIndexes; > } > public boolean hasNext() { > if (currentRow < numRowsInBatch) { > return true; > } > if (!input.hasNext()) { > return false; > } > org.apache.spark.sql.execution.columnar.CachedBatch batch = > (org.apache.spark.sql.execution.columnar.CachedBatch) input.next(); > currentRow = 0; > numRowsInBatch = batch.numRows(); > for (int i = 0; i < columnIndexes.length; i ++) { > buffers[i] = batch.buffers()[columnIndexes[i]]; > } > accessor = new > org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[0]).order(nativeOrder)); > accessor1 = new > org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[1]).order(nativeOrder)); > accessor2 = new > org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[2]).order(nativeOrder)); > accessor3 = new > org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[3]).order(nativeOrder)); > accessor4 = new > org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[4]).order(nativeOrder)); > accessor5 = ne
[jira] [Created] (SPARK-21768) spark.csv.read Empty String Parsed as NULL when nullValue is Set
Andrew Gross created SPARK-21768: Summary: spark.csv.read Empty String Parsed as NULL when nullValue is Set Key: SPARK-21768 URL: https://issues.apache.org/jira/browse/SPARK-21768 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.2.0, 2.0.2 Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2) PySpark Reporter: Andrew Gross In a CSV with quoted fields, empty strings will be interpreted as NULL even when a nullValue is explicitly set: Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX {{"XXNULLXX"|""|"XXNULLXX"|"foo"}} PySpark Script to load the file (from S3): {code:title=load.py|borderStyle=solid} from pyspark.sql import SparkSession from pyspark.sql.types import StringType, StructField, StructType spark = SparkSession.builder.appName("test_csv").getOrCreate() fields = [] fields.append(StructField("First Null Field", StringType(), True)) fields.append(StructField("Empty String Field", StringType(), True)) fields.append(StructField("Second Null Field", StringType(), True)) fields.append(StructField("Non Empty String Field", StringType(), True)) schema = StructType(fields) keys = ['s3://mybucket/test/demo.csv'] bad_data = spark.read.csv(keys, timestampFormat="-MM-dd HH:mm:ss", mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema) bad_data.show() {code} Output {noformat} ++--+-+--+ |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| ++--+-+--+ |null| null| null| foo| ++--+-+--+ {noformat} Expected Output: {noformat} ++--+-+--+ |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| ++--+-+--+ |null| | null| foo| ++--+-+--+ {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible
[ https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131190#comment-16131190 ] Steve Loughran commented on SPARK-21762: SPARK-20703 simplifies this, especially testing, as it's isolated from FileFormatWriter. Same problem exists though: if you are getting any Create inconsistency, metrics probes trigger failures which may not be present by the time task commit actually takes place > FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new > file isn't yet visible > > > Key: SPARK-21762 > URL: https://issues.apache.org/jira/browse/SPARK-21762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: object stores without complete creation consistency > (this includes AWS S3's caching of negative GET results) >Reporter: Steve Loughran >Priority: Minor > > The metrics collection of SPARK-20703 can trigger premature failure if the > newly written object isn't actually visible yet, that is if, after > {{writer.close()}}, a {{getFileStatus(path)}} returns a > {{FileNotFoundException}}. > Strictly speaking, not having a file immediately visible goes against the > fundamental expectations of the Hadoop FS APIs, namely full consistent data & > medata across all operations, with immediate global visibility of all > changes. However, not all object stores make that guarantee, be it only newly > created data or updated blobs. And so spurious FNFEs can get raised, ones > which *should* have gone away by the time the actual task is committed. Or if > they haven't, the job is in such deep trouble. > What to do? > # leave as is: fail fast & so catch blobstores/blobstore clients which don't > behave as required. One issue here: will that trigger retries, what happens > there, etc, etc. > # Swallow the FNFE and hope the file is observable later. > # Swallow all IOEs and hope that whatever problem the FS has is transient. > Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at > least, not the counter of bytes written. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible
[ https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131190#comment-16131190 ] Steve Loughran edited comment on SPARK-21762 at 8/17/17 7:41 PM: - SPARK-21669 simplifies this, especially testing, as it's isolated from FileFormatWriter. Same problem exists though: if you are getting any Create inconsistency, metrics probes trigger failures which may not be present by the time task commit actually takes place was (Author: ste...@apache.org): SPARK-20703 simplifies this, especially testing, as it's isolated from FileFormatWriter. Same problem exists though: if you are getting any Create inconsistency, metrics probes trigger failures which may not be present by the time task commit actually takes place > FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new > file isn't yet visible > > > Key: SPARK-21762 > URL: https://issues.apache.org/jira/browse/SPARK-21762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: object stores without complete creation consistency > (this includes AWS S3's caching of negative GET results) >Reporter: Steve Loughran >Priority: Minor > > The metrics collection of SPARK-20703 can trigger premature failure if the > newly written object isn't actually visible yet, that is if, after > {{writer.close()}}, a {{getFileStatus(path)}} returns a > {{FileNotFoundException}}. > Strictly speaking, not having a file immediately visible goes against the > fundamental expectations of the Hadoop FS APIs, namely full consistent data & > medata across all operations, with immediate global visibility of all > changes. However, not all object stores make that guarantee, be it only newly > created data or updated blobs. And so spurious FNFEs can get raised, ones > which *should* have gone away by the time the actual task is committed. Or if > they haven't, the job is in such deep trouble. > What to do? > # leave as is: fail fast & so catch blobstores/blobstore clients which don't > behave as required. One issue here: will that trigger retries, what happens > there, etc, etc. > # Swallow the FNFE and hope the file is observable later. > # Swallow all IOEs and hope that whatever problem the FS has is transient. > Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at > least, not the counter of bytes written. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible
[ https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-21762: --- Summary: FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible (was: FileFormatWriter metrics collection fails if a newly close()d file isn't yet visible) > FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new > file isn't yet visible > > > Key: SPARK-21762 > URL: https://issues.apache.org/jira/browse/SPARK-21762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: object stores without complete creation consistency > (this includes AWS S3's caching of negative GET results) >Reporter: Steve Loughran >Priority: Minor > > The metrics collection of SPARK-20703 can trigger premature failure if the > newly written object isn't actually visible yet, that is if, after > {{writer.close()}}, a {{getFileStatus(path)}} returns a > {{FileNotFoundException}}. > Strictly speaking, not having a file immediately visible goes against the > fundamental expectations of the Hadoop FS APIs, namely full consistent data & > medata across all operations, with immediate global visibility of all > changes. However, not all object stores make that guarantee, be it only newly > created data or updated blobs. And so spurious FNFEs can get raised, ones > which *should* have gone away by the time the actual task is committed. Or if > they haven't, the job is in such deep trouble. > What to do? > # leave as is: fail fast & so catch blobstores/blobstore clients which don't > behave as required. One issue here: will that trigger retries, what happens > there, etc, etc. > # Swallow the FNFE and hope the file is observable later. > # Swallow all IOEs and hope that whatever problem the FS has is transient. > Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at > least, not the counter of bytes written. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21767) Add Decimal Test For Avro in VersionSuite
[ https://issues.apache.org/jira/browse/SPARK-21767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131028#comment-16131028 ] Xiao Li commented on SPARK-21767: - https://github.com/apache/spark/pull/18977 > Add Decimal Test For Avro in VersionSuite > - > > Key: SPARK-21767 > URL: https://issues.apache.org/jira/browse/SPARK-21767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Decimal is a logical type of AVRO. We need to ensure the support of Hive's > AVRO serde works well in Spark -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21767) Add Decimal Test For Avro in VersionSuite
[ https://issues.apache.org/jira/browse/SPARK-21767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21767: Description: Decimal is a logical type of AVRO. We need to ensure the support of Hive's AVRO serde works well in Spark > Add Decimal Test For Avro in VersionSuite > - > > Key: SPARK-21767 > URL: https://issues.apache.org/jira/browse/SPARK-21767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Decimal is a logical type of AVRO. We need to ensure the support of Hive's > AVRO serde works well in Spark -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21767) Add Decimal Test For Avro in VersionSuite
Xiao Li created SPARK-21767: --- Summary: Add Decimal Test For Avro in VersionSuite Key: SPARK-21767 URL: https://issues.apache.org/jira/browse/SPARK-21767 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21766) DataFrame toPandas() raises ValueError with nullable int columns
[ https://issues.apache.org/jira/browse/SPARK-21766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-21766: - Summary: DataFrame toPandas() raises ValueError with nullable int columns (was: DataFrame toPandas() ) > DataFrame toPandas() raises ValueError with nullable int columns > > > Key: SPARK-21766 > URL: https://issues.apache.org/jira/browse/SPARK-21766 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > When calling {{DataFrame.toPandas()}} (without Arrow enabled), if there is a > IntegerType column that has null values the following exception is thrown: > {noformat} > ValueError: Cannot convert non-finite values (NA or inf) to integer > {noformat} > This is because the null values first get converted to float NaN during the > construction of the Pandas DataFrame in {{from_records}}, and then it is > attempted to be converted back to to an integer where it fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21766) DataFrame toPandas()
Bryan Cutler created SPARK-21766: Summary: DataFrame toPandas() Key: SPARK-21766 URL: https://issues.apache.org/jira/browse/SPARK-21766 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler When calling {{DataFrame.toPandas()}} (without Arrow enabled), if there is a IntegerType column that has null values the following exception is thrown: {noformat} ValueError: Cannot convert non-finite values (NA or inf) to integer {noformat} This is because the null values first get converted to float NaN during the construction of the Pandas DataFrame in {{from_records}}, and then it is attempted to be converted back to to an integer where it fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130863#comment-16130863 ] Wenchen Fan edited comment on SPARK-15689 at 8/17/17 5:25 PM: -- google doc attached! was (Author: cloud_fan): good doc attached! > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130863#comment-16130863 ] Wenchen Fan commented on SPARK-15689: - good doc attached! > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17414) Set type is not supported for creating data frames
[ https://issues.apache.org/jira/browse/SPARK-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130852#comment-16130852 ] Alexander Bessonov commented on SPARK-17414: Fixed in SPARK-21204 > Set type is not supported for creating data frames > -- > > Key: SPARK-17414 > URL: https://issues.apache.org/jira/browse/SPARK-17414 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Emre Colak >Priority: Minor > > For a case class that has a field of type Set, createDataFrame() method > throws an exception saying "Schema for type Set is not supported". Exception > is raised by the org.apache.spark.sql.catalyst.ScalaReflection class where > Array, Seq and Map types are supported but Set is not. It would be nice to > support Set here by default instead of having to write a custom Encoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21765) Mark all streaming plans as isStreaming
Jose Torres created SPARK-21765: --- Summary: Mark all streaming plans as isStreaming Key: SPARK-21765 URL: https://issues.apache.org/jira/browse/SPARK-21765 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 2.2.0 Reporter: Jose Torres LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some streaming sources don't set the bit, and the bit can sometimes be lost in rewriting. Setting the bit for all plans that are logically streaming will help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths
Hyukjin Kwon created SPARK-21764: Summary: Tests failures on Windows: resources not being closed and incorrect paths Key: SPARK-21764 URL: https://issues.apache.org/jira/browse/SPARK-21764 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Priority: Minor This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 but decided to open another one here, targeting 2.3.0 as fixed version. In short, there are many test failures on Windows, mainly due to resources not being closed but attempted to be removed (this is failed on Windows) and incorrect path inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21763) InferSchema option does not infer the correct schema (timestamp) from xlsx file.
ANSHUMAN created SPARK-21763: Summary: InferSchema option does not infer the correct schema (timestamp) from xlsx file. Key: SPARK-21763 URL: https://issues.apache.org/jira/browse/SPARK-21763 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Environment: Environment is my personal laptop. Reporter: ANSHUMAN Priority: Minor I have a xlsx file containing date/time filed (My Time) in following format and sample records - 5/16/2017 12:19:00 AM 5/16/2017 12:56:00 AM 5/16/2017 1:17:00 PM 5/16/2017 5:26:00 PM 5/16/2017 6:26:00 PM I am reading the xlsx file in following manner: - {code:java} val inputDF = spark.sqlContext.read.format("com.crealytics.spark.excel") .option("location","file:///C:/Users/file.xlsx") .option("useHeader","true") .option("treatEmptyValuesAsNulls","true") .option("inferSchema","true") .option("addColorColumns","false") .load() {code} When I try to get schema using {code:java} inputDF.printSchema() {code} , I get *Double*. Sometimes, even I get the schema as *String*. And when I print the data, I get the output as: - +--+ | My Time| +--+ |42871.014189814814| | 42871.03973379629| |42871.553773148145| | 42871.72765046296| | 42871.76887731482| +--+ Above output is clearly not correct for the given input. Moreover, if I convert the xlsx file in csv format and read it, I get the output correctly. Here is the way how I read in csv format: - {code:java} spark.sqlContext.read.format("csv") .option("header", "true") .option("inferSchema", true) .load(fileLocation) {code} Please look into the issue. I could not find the answer to it anywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21428) CliSessionState never be recognized because of IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21428: --- Assignee: Kent Yao > CliSessionState never be recognized because of IsolatedClientLoader > --- > > Key: SPARK-21428 > URL: https://issues.apache.org/jira/browse/SPARK-21428 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.1, 2.2.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 2.3.0 > > > When using bin/spark-sql with the builtin hive jars, we are expecting to > reuse the instance ofCliSessionState. > {quote} > // In `SparkSQLCLIDriver`, we have already started a > `CliSessionState`, > // which contains information like configurations from command line. > Later > // we call `SparkSQLEnv.init()` there, which would run into this part > again. > // so we should keep `conf` and reuse the existing instance of > `CliSessionState`. > {quote} > Actually it never ever happened since SessionState.get() at > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L138 > will always be null by IsolatedClientLoader. > The SessionState.start was called many times, which will creates > `hive.exec.strachdir`, see the following case... > {code:java} > spark git:(master) bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/07/16 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/07/16 23:29:04 INFO HiveMetaStore: 0: Opening raw store with implemenation > class:org.apache.hadoop.hive.metastore.ObjectStore > 17/07/16 23:29:04 INFO ObjectStore: ObjectStore, initialize called > 17/07/16 23:29:04 INFO Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 17/07/16 23:29:04 INFO Persistence: Property datanucleus.cache.level2 unknown > - will be ignored > 17/07/16 23:29:05 INFO ObjectStore: Setting MetaStore object pin classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 17/07/16 23:29:06 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 17/07/16 23:29:06 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 17/07/16 23:29:07 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 17/07/16 23:29:07 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 17/07/16 23:29:07 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is > DERBY > 17/07/16 23:29:07 INFO ObjectStore: Initialized ObjectStore > 17/07/16 23:29:07 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 17/07/16 23:29:07 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > 17/07/16 23:29:08 INFO HiveMetaStore: Added admin role in metastore > 17/07/16 23:29:08 INFO HiveMetaStore: Added public role in metastore > 17/07/16 23:29:08 INFO HiveMetaStore: No user is added in admin role, since > config is empty > 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_all_databases > 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr > cmd=get_all_databases > 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_functions: db=default pat=* > 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr > cmd=get_functions: db=default pat=* > 17/07/16 23:29:08 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as > "embedded-only" so does not have its own datastore table. > 17/07/16 23:29:08 INFO SessionState: Created local directory: > /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/a2c40e42-08e2-4023-8464-3432ed690184_resources > 17/07/16 23:29:08 INFO SessionState: Created HDFS directory: > /tmp/hive/Kent/a2c40e42-08e2-4023-8464-3432ed690184 > 17/07/16 23:29:08 INFO SessionState: Created local directory: > /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/Kent/a2c40e42-08e2-4023-8464-3432ed690184 > 17/07/16 23:29:08 INFO SessionState: Created HDFS directory: > /tmp/hive/Kent/a2c40e42-08e2-4023-8464-3432ed690184/_tmp_space.db > 1
[jira] [Resolved] (SPARK-21428) CliSessionState never be recognized because of IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21428. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18648 [https://github.com/apache/spark/pull/18648] > CliSessionState never be recognized because of IsolatedClientLoader > --- > > Key: SPARK-21428 > URL: https://issues.apache.org/jira/browse/SPARK-21428 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.1, 2.2.0 >Reporter: Kent Yao >Priority: Minor > Fix For: 2.3.0 > > > When using bin/spark-sql with the builtin hive jars, we are expecting to > reuse the instance ofCliSessionState. > {quote} > // In `SparkSQLCLIDriver`, we have already started a > `CliSessionState`, > // which contains information like configurations from command line. > Later > // we call `SparkSQLEnv.init()` there, which would run into this part > again. > // so we should keep `conf` and reuse the existing instance of > `CliSessionState`. > {quote} > Actually it never ever happened since SessionState.get() at > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L138 > will always be null by IsolatedClientLoader. > The SessionState.start was called many times, which will creates > `hive.exec.strachdir`, see the following case... > {code:java} > spark git:(master) bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/07/16 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/07/16 23:29:04 INFO HiveMetaStore: 0: Opening raw store with implemenation > class:org.apache.hadoop.hive.metastore.ObjectStore > 17/07/16 23:29:04 INFO ObjectStore: ObjectStore, initialize called > 17/07/16 23:29:04 INFO Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 17/07/16 23:29:04 INFO Persistence: Property datanucleus.cache.level2 unknown > - will be ignored > 17/07/16 23:29:05 INFO ObjectStore: Setting MetaStore object pin classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 17/07/16 23:29:06 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 17/07/16 23:29:06 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 17/07/16 23:29:07 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 17/07/16 23:29:07 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 17/07/16 23:29:07 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is > DERBY > 17/07/16 23:29:07 INFO ObjectStore: Initialized ObjectStore > 17/07/16 23:29:07 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 17/07/16 23:29:07 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > 17/07/16 23:29:08 INFO HiveMetaStore: Added admin role in metastore > 17/07/16 23:29:08 INFO HiveMetaStore: Added public role in metastore > 17/07/16 23:29:08 INFO HiveMetaStore: No user is added in admin role, since > config is empty > 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_all_databases > 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr > cmd=get_all_databases > 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_functions: db=default pat=* > 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr > cmd=get_functions: db=default pat=* > 17/07/16 23:29:08 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as > "embedded-only" so does not have its own datastore table. > 17/07/16 23:29:08 INFO SessionState: Created local directory: > /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/a2c40e42-08e2-4023-8464-3432ed690184_resources > 17/07/16 23:29:08 INFO SessionState: Created HDFS directory: > /tmp/hive/Kent/a2c40e42-08e2-4023-8464-3432ed690184 > 17/07/16 23:29:08 INFO SessionState: Created local directory: > /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/Kent/a2c40e42-08e2-4023-8464-3432ed690184 > 17/07/16 23:29:08 INFO SessionState: Created HDFS directory:
[jira] [Created] (SPARK-21762) FileFormatWriter metrics collection fails if a newly close()d file isn't yet visible
Steve Loughran created SPARK-21762: -- Summary: FileFormatWriter metrics collection fails if a newly close()d file isn't yet visible Key: SPARK-21762 URL: https://issues.apache.org/jira/browse/SPARK-21762 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: object stores without complete creation consistency (this includes AWS S3's caching of negative GET results) Reporter: Steve Loughran Priority: Minor The metrics collection of SPARK-20703 can trigger premature failure if the newly written object isn't actually visible yet, that is if, after {{writer.close()}}, a {{getFileStatus(path)}} returns a {{FileNotFoundException}}. Strictly speaking, not having a file immediately visible goes against the fundamental expectations of the Hadoop FS APIs, namely full consistent data & medata across all operations, with immediate global visibility of all changes. However, not all object stores make that guarantee, be it only newly created data or updated blobs. And so spurious FNFEs can get raised, ones which *should* have gone away by the time the actual task is committed. Or if they haven't, the job is in such deep trouble. What to do? # leave as is: fail fast & so catch blobstores/blobstore clients which don't behave as required. One issue here: will that trigger retries, what happens there, etc, etc. # Swallow the FNFE and hope the file is observable later. # Swallow all IOEs and hope that whatever problem the FS has is transient. Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at least, not the counter of bytes written. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525 ] Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:15 PM: -- [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar to (likely). Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. I searched the places where this variable is utilized so nothing related to SparkConf unless somehow you use spark submit (pyspark calls that btw). I will try installing pyspark with pip but not sure if it will make any difference. was (Author: skonto): [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar to (likely). Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. I searched the places where this variable is utilized s
[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525 ] Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:12 PM: -- [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar to (likely). Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. I searched the places where this variable is utilized so nothing related to SparkConf unless somehow you use spark submit (pyspark calls that btw). I will try installing pyspark with pip but not sure if it will make any difference. was (Author: skonto): [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar to (likely). Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. I searched the places where this variable is utilized so
[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525 ] Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:11 PM: -- [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar to (likely). Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. I searched the places where this variable is utilized so nothing related to SparkConf unless somehow you use spark submit (pyspark calls that btw). was (Author: skonto): [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar to (likely). Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. > Config spark.jars.packages is ignored in SparkSession config > > >
[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525 ] Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:02 PM: -- [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar to (likely). Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. was (Author: skonto): [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar with. Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug >
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525 ] Stavros Kontopoulos commented on SPARK-21752: - [~jsnowacki] I dont think I am doing anything wrong. I followed your instructions. I use pyspark which comes with the spark distro no need to install it on my system. So when I do: export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_DRIVER_PYTHON=jupyter and then ./pyspark I have a fully working jupyter notebook. Also by typing in a cell spark, a spark session is already defined and there is also sc defined. SparkSession - in-memory SparkContext Spark UI Version v2.3.0-SNAPSHOT Master local[*] AppName PySparkShell So its not the case that you need to setup spark session on your own unless things are setup in some other way I am not familiar with. Then I run your example but the --packages has no effect. {code:java} import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() people = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"]) people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save() {code} Check here: https://github.com/jupyter/notebook/issues/743 https://gist.github.com/ololobus/4c221a0891775eaa86b0 for someways to start things. Now, I suspect this is the responsible line https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54 so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what I observed java gateway is used once when my pythonbook is started. You can easily check that by modifying the file to print something and also by checking if you have spark already defined as in my case. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option. -- This message was sent by Atlassian JIRA (v6.4.14#
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130494#comment-16130494 ] Dongjoon Hyun commented on SPARK-15689: --- Thank you for the document, too! > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130472#comment-16130472 ] Varene Olivier commented on SPARK-21063: Hi, I am experiencing the same issue with Spark 2.2.0 and HiveServer2 via http {code:scala} // transform scala Map[String,String] to Java Properties (needed by jdbc driver) implicit def map2Properties(map:Map[String,String]):java.util.Properties = { (new java.util.Properties /: map) {case (props, (k,v)) => props.put(k,v); props} } val url = "jdbc:hive2://remote.server:10001/myDatabase;transportMode=http;httpPath=cliservice" // from "org.apache.hive" % "hive-jdbc" % "1.2.2" val driver = "org.apache.hive.jdb.HiveServer" val user = "myRemoteUser" val password = "myRemotePassword" val table = "myNonEmptyTable" val props = Map("user" -> user,"password" -> password, "driver" -> driver) val d = spark.read.jdbc(url,table,props) println(d.count) {code} returns : {code} 0 {code} and my table is not empty > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130472#comment-16130472 ] Varene Olivier edited comment on SPARK-21063 at 8/17/17 2:25 PM: - Hi, I am experiencing the same issue with Spark 2.2.0 and HiveServer2 via http {code} // transform scala Map[String,String] to Java Properties (needed by jdbc driver) implicit def map2Properties(map:Map[String,String]):java.util.Properties = { (new java.util.Properties /: map) {case (props, (k,v)) => props.put(k,v); props} } val url = "jdbc:hive2://remote.server:10001/myDatabase;transportMode=http;httpPath=cliservice" // from "org.apache.hive" % "hive-jdbc" % "1.2.2" val driver = "org.apache.hive.jdb.HiveServer" val user = "myRemoteUser" val password = "myRemotePassword" val table = "myNonEmptyTable" val props = Map("user" -> user,"password" -> password, "driver" -> driver) val d = spark.read.jdbc(url,table,props) println(d.count) {code} returns : {code} 0 {code} and my table is not empty was (Author: ov): Hi, I am experiencing the same issue with Spark 2.2.0 and HiveServer2 via http {code:scala} // transform scala Map[String,String] to Java Properties (needed by jdbc driver) implicit def map2Properties(map:Map[String,String]):java.util.Properties = { (new java.util.Properties /: map) {case (props, (k,v)) => props.put(k,v); props} } val url = "jdbc:hive2://remote.server:10001/myDatabase;transportMode=http;httpPath=cliservice" // from "org.apache.hive" % "hive-jdbc" % "1.2.2" val driver = "org.apache.hive.jdb.HiveServer" val user = "myRemoteUser" val password = "myRemotePassword" val table = "myNonEmptyTable" val props = Map("user" -> user,"password" -> password, "driver" -> driver) val d = spark.read.jdbc(url,table,props) println(d.count) {code} returns : {code} 0 {code} and my table is not empty > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21642) Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
[ https://issues.apache.org/jira/browse/SPARK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21642: --- Assignee: Aki Tanaka (was: Hideaki Tanaka) > Use FQDN for DRIVER_HOST_ADDRESS instead of ip address > -- > > Key: SPARK-21642 > URL: https://issues.apache.org/jira/browse/SPARK-21642 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.2.0 >Reporter: Aki Tanaka >Assignee: Aki Tanaka > Fix For: 2.3.0 > > > In current implementation, ip address of a driver host is set to > DRIVER_HOST_ADDRESS [1]. This becomes a problem when we enable SSL using > "spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" > properties. When we configure these properties, spark web ui is launched with > SSL enabled and the HTTPS server is configured with the custom SSL > certificate you configured in these properties. > In this case, client gets javax.net.ssl.SSLPeerUnverifiedException exception > when the client accesses the spark web ui because the client fails to verify > the SSL certificate (Common Name of the SSL cert does not match with > DRIVER_HOST_ADDRESS). > To avoid the exception, we should use FQDN of the driver host for > DRIVER_HOST_ADDRESS. > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L222 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L942 > Error message that client gets when the client accesses spark web ui: > javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> > doesn't match any of the subject alternative names: [] > {code:java} > $ spark-submit /path/to/jar > .. > 17/08/04 14:48:07 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 17/08/04 14:48:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://10.43.3.8:4040 > $ curl -I http://10.43.3.8:4040 > HTTP/1.1 302 Found > Date: Fri, 04 Aug 2017 14:48:20 GMT > Location: https://10.43.3.8:4440/ > Content-Length: 0 > Server: Jetty(9.2.z-SNAPSHOT) > $ curl -v https://10.43.3.8:4440 > * Rebuilt URL to: https://10.43.3.8:4440/ > * Trying 10.43.3.8... > * TCP_NODELAY set > * Connected to 10.43.3.8 (10.43.3.8) port 4440 (#0) > * Initializing NSS with certpath: sql:/etc/pki/nssdb > * CAfile: /etc/pki/tls/certs/ca-bundle.crt > CApath: none > * Server certificate: > * subject: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US > * start date: Jun 12 00:05:02 2017 GMT > * expire date: Jun 12 00:05:02 2018 GMT > * common name: *.example.com > * issuer: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21642) Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
[ https://issues.apache.org/jira/browse/SPARK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21642: --- Assignee: Hideaki Tanaka > Use FQDN for DRIVER_HOST_ADDRESS instead of ip address > -- > > Key: SPARK-21642 > URL: https://issues.apache.org/jira/browse/SPARK-21642 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.2.0 >Reporter: Aki Tanaka >Assignee: Hideaki Tanaka > Fix For: 2.3.0 > > > In current implementation, ip address of a driver host is set to > DRIVER_HOST_ADDRESS [1]. This becomes a problem when we enable SSL using > "spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" > properties. When we configure these properties, spark web ui is launched with > SSL enabled and the HTTPS server is configured with the custom SSL > certificate you configured in these properties. > In this case, client gets javax.net.ssl.SSLPeerUnverifiedException exception > when the client accesses the spark web ui because the client fails to verify > the SSL certificate (Common Name of the SSL cert does not match with > DRIVER_HOST_ADDRESS). > To avoid the exception, we should use FQDN of the driver host for > DRIVER_HOST_ADDRESS. > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L222 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L942 > Error message that client gets when the client accesses spark web ui: > javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> > doesn't match any of the subject alternative names: [] > {code:java} > $ spark-submit /path/to/jar > .. > 17/08/04 14:48:07 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 17/08/04 14:48:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://10.43.3.8:4040 > $ curl -I http://10.43.3.8:4040 > HTTP/1.1 302 Found > Date: Fri, 04 Aug 2017 14:48:20 GMT > Location: https://10.43.3.8:4440/ > Content-Length: 0 > Server: Jetty(9.2.z-SNAPSHOT) > $ curl -v https://10.43.3.8:4440 > * Rebuilt URL to: https://10.43.3.8:4440/ > * Trying 10.43.3.8... > * TCP_NODELAY set > * Connected to 10.43.3.8 (10.43.3.8) port 4440 (#0) > * Initializing NSS with certpath: sql:/etc/pki/nssdb > * CAfile: /etc/pki/tls/certs/ca-bundle.crt > CApath: none > * Server certificate: > * subject: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US > * start date: Jun 12 00:05:02 2017 GMT > * expire date: Jun 12 00:05:02 2018 GMT > * common name: *.example.com > * issuer: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21642) Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
[ https://issues.apache.org/jira/browse/SPARK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21642. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18846 [https://github.com/apache/spark/pull/18846] > Use FQDN for DRIVER_HOST_ADDRESS instead of ip address > -- > > Key: SPARK-21642 > URL: https://issues.apache.org/jira/browse/SPARK-21642 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.2.0 >Reporter: Aki Tanaka > Fix For: 2.3.0 > > > In current implementation, ip address of a driver host is set to > DRIVER_HOST_ADDRESS [1]. This becomes a problem when we enable SSL using > "spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" > properties. When we configure these properties, spark web ui is launched with > SSL enabled and the HTTPS server is configured with the custom SSL > certificate you configured in these properties. > In this case, client gets javax.net.ssl.SSLPeerUnverifiedException exception > when the client accesses the spark web ui because the client fails to verify > the SSL certificate (Common Name of the SSL cert does not match with > DRIVER_HOST_ADDRESS). > To avoid the exception, we should use FQDN of the driver host for > DRIVER_HOST_ADDRESS. > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L222 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L942 > Error message that client gets when the client accesses spark web ui: > javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> > doesn't match any of the subject alternative names: [] > {code:java} > $ spark-submit /path/to/jar > .. > 17/08/04 14:48:07 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 17/08/04 14:48:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://10.43.3.8:4040 > $ curl -I http://10.43.3.8:4040 > HTTP/1.1 302 Found > Date: Fri, 04 Aug 2017 14:48:20 GMT > Location: https://10.43.3.8:4440/ > Content-Length: 0 > Server: Jetty(9.2.z-SNAPSHOT) > $ curl -v https://10.43.3.8:4440 > * Rebuilt URL to: https://10.43.3.8:4440/ > * Trying 10.43.3.8... > * TCP_NODELAY set > * Connected to 10.43.3.8 (10.43.3.8) port 4440 (#0) > * Initializing NSS with certpath: sql:/etc/pki/nssdb > * CAfile: /etc/pki/tls/certs/ca-bundle.crt > CApath: none > * Server certificate: > * subject: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US > * start date: Jun 12 00:05:02 2017 GMT > * expire date: Jun 12 00:05:02 2018 GMT > * common name: *.example.com > * issuer: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21743) top-most limit should not cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130387#comment-16130387 ] Wenchen Fan commented on SPARK-21743: - issue resolved by https://github.com/apache/spark/pull/18955 > top-most limit should not cause memory leak > --- > > Key: SPARK-21743 > URL: https://issues.apache.org/jira/browse/SPARK-21743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130355#comment-16130355 ] Russell Spitzer commented on SPARK-15689: - Thanks [~cloud_fan] for posting the design doc it was a great read and I like a lot of the direction this is going in. It would helpful if we could have access to the doc as a google doc or some other editable/comment-able form though to encourage discussion. I left some comments on the prototype but one thing I think could be a great addition would be a joinInterface. I ended up writing up one of these specifically for Cassandra and had to do a lot of plumbing to get it to fit into the rest of the Catalyst ecosystem so I think this would be a great time to plan ahead in Spark design. The join interface would look a lot like a combination of the read and write apis, given a row input and a set of expressions the relationship should return rows that match those expressions OR fallback to just being a read relationship if none of the expressions can be satisfied by the join (leaving all the expressions to be evaluated in spark). > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15689: Attachment: SPIP Data Source API V2.pdf > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21761) [Core] Add the application's final state for SparkListenerApplicationEnd event
lishuming created SPARK-21761: - Summary: [Core] Add the application's final state for SparkListenerApplicationEnd event Key: SPARK-21761 URL: https://issues.apache.org/jira/browse/SPARK-21761 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.1.1 Reporter: lishuming Priority: Minor When add an extra `SparkListener`, we want to get the application's final state which I think is necessary to record. Maybe we can change `SparkListenerApplicationEnd` as below: {code:java} case class SparkListenerApplicationEnd(time: Long, sparkUser: String) extends SparkListenerEvent import org.apache.spark.launcher.SparkAppHandle.State case class SparkListenerApplicationEnd(time: Long, sparkUser: String, status: State) extends SparkListenerEvent {code} Of course, we should add some implements to this change for different deployed mode. Can someone give me some advice? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21758) `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*
[ https://issues.apache.org/jira/browse/SPARK-21758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130288#comment-16130288 ] Feng Zhu commented on SPARK-21758: -- I can't reproduce this issue in 2.1 and master branch, could you provide more details? > `SHOW TBLPROPERTIES` can not get properties start with spark.sql.* > -- > > Key: SPARK-21758 > URL: https://issues.apache.org/jira/browse/SPARK-21758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: StanZhai >Priority: Critical > > SQL: SHOW TBLPROPERTIES test_tb("spark.sql.sources.schema.numParts") > Exception: Table test_db.test.tb does not have property: > spark.sql.sources.schema.numParts > The `spark.sql.sources.schema.numParts` property exactly exists in > HiveMetastore. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130282#comment-16130282 ] Jakub Nowacki commented on SPARK-21752: --- [~skonto] Well, I'm not sure where you're failing here. If you want to get PySpark installed with a vanilla Python distribution you can do {{pip install pyspark}} or {{conda install -c conda-forge pyspark}}. Other that that, the above scripts are complete, bar the {{import pyspark}} as I mentioned before. Below I give a sightly more complete example with the env variable: {code} import os import pyspark os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() l = [("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)] people = spark.createDataFrame(l, ["name", "age"]) people.write \ .format("com.mongodb.spark.sql.DefaultSource") \ .mode("append") \ .save() spark.read \ .format("com.mongodb.spark.sql.DefaultSource") \ .load() \ .show() {code} and with the {{SparkConfig}} approach: {code} import pyspark conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() l = [("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)] people = spark.createDataFrame(l, ["name", "age"]) people.write \ .format("com.mongodb.spark.sql.DefaultSource") \ .mode("append") \ .save() spark.read \ .format("com.mongodb.spark.sql.DefaultSource") \ .load() \ .show() {code} and with the plain {{SparkSession}} config: {code} import pyspark spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() l = [("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)] people = spark.createDataFrame(l, ["name", "age"]) people.write \ .format("com.mongodb.spark.sql.DefaultSource") \ .mode("append") \ .save() spark.read \ .format("com.mongodb.spark.sql.DefaultSource") \ .load() \ .show() {code} In my case the first two work as expected, and the last one fails with {{ClassNotFoundException}}. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130274#comment-16130274 ] Jakub Nowacki commented on SPARK-21752: --- OK I get the point. I think we should only consider this in an interactive, notebook based environment. I don't use the master for sure in the {{spark-submit}} executioner, but also using packages internally should be discouraged. I think it should be a bit more clear in documentation what can and what cannot be used. Also, interactive environment like Jupyter or similar should be made as an exception, or more clear description for setup should be provided. Also, especially with using the above setting with packages, there is no warning provided that this option is really ignored, thus, maybe one should be added similar to the one with reusing existing SparkSession, i.e. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L896 > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org