[jira] [Commented] (SPARK-25923) SparkR UT Failure (checking CRAN incoming feasibility)

2018-11-03 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674216#comment-16674216
 ] 

Felix Cheung commented on SPARK-25923:
--

thanks - what's the exchange required with CRAN admin?

> SparkR UT Failure (checking CRAN incoming feasibility)
> --
>
> Key: SPARK-25923
> URL: https://issues.apache.org/jira/browse/SPARK-25923
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
>
> Currently, the following SparkR error blocks PR builders.
> {code:java}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 26] do not match the length of object [0]
> Execution halted
> {code}
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98362/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98367/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98368/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4403/testReport/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25934) Mesos: SPARK_CONF_DIR should not be propogated by spark submit

2018-11-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674191#comment-16674191
 ] 

Apache Spark commented on SPARK-25934:
--

User 'mpmolek' has created a pull request for this issue:
https://github.com/apache/spark/pull/22937

> Mesos: SPARK_CONF_DIR should not be propogated by spark submit
> --
>
> Key: SPARK-25934
> URL: https://issues.apache.org/jira/browse/SPARK-25934
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.2
>Reporter: Matt Molek
>Priority: Major
>
> This is very similar to how SPARK_HOME caused problems for spark on Mesos in 
> SPARK-12345
> The `spark submit` command is setting spark.mesos.driverEnv.SPARK_CONF_DIR to 
> whatever the SPARK_CONF_DIR was for the command that submitted the job.
> This is doesn't make sense for most mesos situations, and it broke spark for 
> my team when we upgraded from 2.2.0 to 2.3.2. I haven't tested it but I think 
> 2.4.0 will have the same issue.
> It's preventing spark-env.sh from running because now SPARK_CONF_DIR points 
> to some non-existent directory, instead of the unpacked spark binary in the 
> Mesos sandbox like it should.
> I'm not that familiar with the spark code base, but I think this could be 
> fixed by simply adding a `&& k != "SPARK_CONF_DIR"` clause to this filter 
> statement: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala#L421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25934) Mesos: SPARK_CONF_DIR should not be propogated by spark submit

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25934:


Assignee: (was: Apache Spark)

> Mesos: SPARK_CONF_DIR should not be propogated by spark submit
> --
>
> Key: SPARK-25934
> URL: https://issues.apache.org/jira/browse/SPARK-25934
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.2
>Reporter: Matt Molek
>Priority: Major
>
> This is very similar to how SPARK_HOME caused problems for spark on Mesos in 
> SPARK-12345
> The `spark submit` command is setting spark.mesos.driverEnv.SPARK_CONF_DIR to 
> whatever the SPARK_CONF_DIR was for the command that submitted the job.
> This is doesn't make sense for most mesos situations, and it broke spark for 
> my team when we upgraded from 2.2.0 to 2.3.2. I haven't tested it but I think 
> 2.4.0 will have the same issue.
> It's preventing spark-env.sh from running because now SPARK_CONF_DIR points 
> to some non-existent directory, instead of the unpacked spark binary in the 
> Mesos sandbox like it should.
> I'm not that familiar with the spark code base, but I think this could be 
> fixed by simply adding a `&& k != "SPARK_CONF_DIR"` clause to this filter 
> statement: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala#L421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25934) Mesos: SPARK_CONF_DIR should not be propogated by spark submit

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25934:


Assignee: Apache Spark

> Mesos: SPARK_CONF_DIR should not be propogated by spark submit
> --
>
> Key: SPARK-25934
> URL: https://issues.apache.org/jira/browse/SPARK-25934
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.2
>Reporter: Matt Molek
>Assignee: Apache Spark
>Priority: Major
>
> This is very similar to how SPARK_HOME caused problems for spark on Mesos in 
> SPARK-12345
> The `spark submit` command is setting spark.mesos.driverEnv.SPARK_CONF_DIR to 
> whatever the SPARK_CONF_DIR was for the command that submitted the job.
> This is doesn't make sense for most mesos situations, and it broke spark for 
> my team when we upgraded from 2.2.0 to 2.3.2. I haven't tested it but I think 
> 2.4.0 will have the same issue.
> It's preventing spark-env.sh from running because now SPARK_CONF_DIR points 
> to some non-existent directory, instead of the unpacked spark binary in the 
> Mesos sandbox like it should.
> I'm not that familiar with the spark code base, but I think this could be 
> fixed by simply adding a `&& k != "SPARK_CONF_DIR"` clause to this filter 
> statement: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala#L421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25934) Mesos: SPARK_CONF_DIR should not be propogated by spark submit

2018-11-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674189#comment-16674189
 ] 

Apache Spark commented on SPARK-25934:
--

User 'mpmolek' has created a pull request for this issue:
https://github.com/apache/spark/pull/22937

> Mesos: SPARK_CONF_DIR should not be propogated by spark submit
> --
>
> Key: SPARK-25934
> URL: https://issues.apache.org/jira/browse/SPARK-25934
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.2
>Reporter: Matt Molek
>Priority: Major
>
> This is very similar to how SPARK_HOME caused problems for spark on Mesos in 
> SPARK-12345
> The `spark submit` command is setting spark.mesos.driverEnv.SPARK_CONF_DIR to 
> whatever the SPARK_CONF_DIR was for the command that submitted the job.
> This is doesn't make sense for most mesos situations, and it broke spark for 
> my team when we upgraded from 2.2.0 to 2.3.2. I haven't tested it but I think 
> 2.4.0 will have the same issue.
> It's preventing spark-env.sh from running because now SPARK_CONF_DIR points 
> to some non-existent directory, instead of the unpacked spark binary in the 
> Mesos sandbox like it should.
> I'm not that familiar with the spark code base, but I think this could be 
> fixed by simply adding a `&& k != "SPARK_CONF_DIR"` clause to this filter 
> statement: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala#L421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25934) Mesos: SPARK_CONF_DIR should not be propogated by spark submit

2018-11-03 Thread Matt Molek (JIRA)
Matt Molek created SPARK-25934:
--

 Summary: Mesos: SPARK_CONF_DIR should not be propogated by spark 
submit
 Key: SPARK-25934
 URL: https://issues.apache.org/jira/browse/SPARK-25934
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.3.2
Reporter: Matt Molek


This is very similar to how SPARK_HOME caused problems for spark on Mesos in 
SPARK-12345

The `spark submit` command is setting spark.mesos.driverEnv.SPARK_CONF_DIR to 
whatever the SPARK_CONF_DIR was for the command that submitted the job.

This is doesn't make sense for most mesos situations, and it broke spark for my 
team when we upgraded from 2.2.0 to 2.3.2. I haven't tested it but I think 
2.4.0 will have the same issue.

It's preventing spark-env.sh from running because now SPARK_CONF_DIR points to 
some non-existent directory, instead of the unpacked spark binary in the Mesos 
sandbox like it should.

I'm not that familiar with the spark code base, but I think this could be fixed 
by simply adding a `&& k != "SPARK_CONF_DIR"` clause to this filter statement: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala#L421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25931:
--
Component/s: Tests

> Benchmarking creation of Jackson parser
> ---
>
> Key: SPARK-25931
> URL: https://issues.apache.org/jira/browse/SPARK-25931
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
> invoke Jackson parser at all due to an optimization for empty schema 
> introduced SPARK-24959. Need to add new benchmark which should forcibly 
> create Jackson parser for short and wide columns. For example:
> {code:scala}
>  spark.read
>   .schema(schema)
>   .json(path)
>   .filter((_: Row) => true)
>   .count()
> {code}
> The *.filter((_: Row) => true)* prevents projection pushdown to JSON 
> datasource and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25933.
---
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0
   2.3.3

Issue resolved by pull request 22933
[https://github.com/apache/spark/pull/22933]

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Assignee: Alex Hagerman
>Priority: Trivial
>  Labels: documentation
> Fix For: 2.3.3, 3.0.0, 2.4.1
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25933:
-

Assignee: Alex Hagerman

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Assignee: Alex Hagerman
>Priority: Trivial
>  Labels: documentation
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25933:
--
Target Version/s:   (was: 2.3.2)
  Labels: documentation  (was: documentation pull-request-available)
   Fix Version/s: (was: 2.3.2)

You don't need a Jira for something this trivial; also don't set target/fix 
version

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25890) Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.

2018-11-03 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674112#comment-16674112
 ] 

Maxim Gekk commented on SPARK-25890:


I haven't reproduced the issue on the master branch.

> Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.
> -
>
> Key: SPARK-25890
> URL: https://issues.apache.org/jira/browse/SPARK-25890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.3.2
>Reporter: Lakshminarayan Kamath
>Priority: Major
>
> Reading a Ctrl-A delimited CSV file ignores rows with all null values. 
> However a comma delimited CSV file doesn't.
> *Reproduction in spark-shell:*
> import org.apache.spark.sql._
>  import org.apache.spark.sql.types._
> val l = List(List(1, 2), List(null,null), List(2,3))
>  val datasetSchema = StructType(List(StructField("colA", IntegerType, true), 
> StructField("colB", IntegerType, true)))
>  val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
>  val df = spark.createDataFrame(rdd, datasetSchema)
> df.show()
> |colA|colB|
> |1   |2   |
> |null|null|
> |2   |3   | |
> df.write.option("delimiter", "\u0001").option("header", 
> "true").csv("/ctrl-a-separated.csv")
>  df.write.option("delimiter", ",").option("header", 
> "true").csv("/comma-separated.csv")
> val commaDf = spark.read.option("header", "true").option("delimiter", 
> ",").csv("/comma-separated.csv")
>  commaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
> |null|null|
> val ctrlaDf = spark.read.option("header", "true").option("delimiter", 
> "\u0001").csv("/ctrl-a-separated.csv")
>  ctrlaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
>  
> As seen above, for Ctrl-A delimited CSV, rows containing only null values are 
> ignored.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19799) Support WITH clause in subqueries

2018-11-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674095#comment-16674095
 ] 

Apache Spark commented on SPARK-19799:
--

User 'gbloisi' has created a pull request for this issue:
https://github.com/apache/spark/pull/22936

> Support WITH clause in subqueries
> -
>
> Key: SPARK-19799
> URL: https://issues.apache.org/jira/browse/SPARK-19799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Major
>
> Because of Spark-17590 it should be relatively easy to support WITH clause in 
> subqueries besides nested CTE definitions.
> Here an example of a query that does not run on spark:
> create table test (seqno int, k string, v int) using parquet;
> insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 
> 66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33);
> SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER 
> (PARTITION BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b 
> FROM test ORDER BY seqno) SELECT k, MAX(b) as b  FROM mavg GROUP BY k);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19799) Support WITH clause in subqueries

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19799:


Assignee: (was: Apache Spark)

> Support WITH clause in subqueries
> -
>
> Key: SPARK-19799
> URL: https://issues.apache.org/jira/browse/SPARK-19799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Major
>
> Because of Spark-17590 it should be relatively easy to support WITH clause in 
> subqueries besides nested CTE definitions.
> Here an example of a query that does not run on spark:
> create table test (seqno int, k string, v int) using parquet;
> insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 
> 66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33);
> SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER 
> (PARTITION BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b 
> FROM test ORDER BY seqno) SELECT k, MAX(b) as b  FROM mavg GROUP BY k);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19799) Support WITH clause in subqueries

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19799:


Assignee: Apache Spark

> Support WITH clause in subqueries
> -
>
> Key: SPARK-19799
> URL: https://issues.apache.org/jira/browse/SPARK-19799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Assignee: Apache Spark
>Priority: Major
>
> Because of Spark-17590 it should be relatively easy to support WITH clause in 
> subqueries besides nested CTE definitions.
> Here an example of a query that does not run on spark:
> create table test (seqno int, k string, v int) using parquet;
> insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 
> 66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33);
> SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER 
> (PARTITION BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b 
> FROM test ORDER BY seqno) SELECT k, MAX(b) as b  FROM mavg GROUP BY k);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25931:
-

Assignee: Maxim Gekk

> Benchmarking creation of Jackson parser
> ---
>
> Key: SPARK-25931
> URL: https://issues.apache.org/jira/browse/SPARK-25931
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
> invoke Jackson parser at all due to an optimization for empty schema 
> introduced SPARK-24959. Need to add new benchmark which should forcibly 
> create Jackson parser for short and wide columns. For example:
> {code:scala}
>  spark.read
>   .schema(schema)
>   .json(path)
>   .filter((_: Row) => true)
>   .count()
> {code}
> The *.filter((_: Row) => true)* prevents projection pushdown to JSON 
> datasource and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25931.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22920
[https://github.com/apache/spark/pull/22920]

> Benchmarking creation of Jackson parser
> ---
>
> Key: SPARK-25931
> URL: https://issues.apache.org/jira/browse/SPARK-25931
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
> invoke Jackson parser at all due to an optimization for empty schema 
> introduced SPARK-24959. Need to add new benchmark which should forcibly 
> create Jackson parser for short and wide columns. For example:
> {code:scala}
>  spark.read
>   .schema(schema)
>   .json(path)
>   .filter((_: Row) => true)
>   .count()
> {code}
> The *.filter((_: Row) => true)* prevents projection pushdown to JSON 
> datasource and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25931:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Benchmarking creation of Jackson parser
> ---
>
> Key: SPARK-25931
> URL: https://issues.apache.org/jira/browse/SPARK-25931
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
> invoke Jackson parser at all due to an optimization for empty schema 
> introduced SPARK-24959. Need to add new benchmark which should forcibly 
> create Jackson parser for short and wide columns. For example:
> {code:scala}
>  spark.read
>   .schema(schema)
>   .json(path)
>   .filter((_: Row) => true)
>   .count()
> {code}
> The *.filter((_: Row) => true)* prevents projection pushdown to JSON 
> datasource and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-11-03 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674066#comment-16674066
 ] 

antonkulaga commented on SPARK-25588:
-

Any updates on this? This bug blocks ADAM library and hence blocks most of 
bioinformaticians using Spark.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }

[jira] [Commented] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674044#comment-16674044
 ] 

Apache Spark commented on SPARK-25933:
--

User 'AlexHagerman' has created a pull request for this issue:
https://github.com/apache/spark/pull/22933

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Priority: Trivial
>  Labels: documentation, pull-request-available
> Fix For: 2.3.2
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25933:


Assignee: (was: Apache Spark)

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Priority: Trivial
>  Labels: documentation, pull-request-available
> Fix For: 2.3.2
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25933:


Assignee: Apache Spark

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: documentation, pull-request-available
> Fix For: 2.3.2
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674043#comment-16674043
 ] 

Alex Hagerman commented on SPARK-25933:
---

https://github.com/apache/spark/pull/22933

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Priority: Trivial
>  Labels: documentation, pull-request-available
> Fix For: 2.3.2
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Alex Hagerman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated SPARK-25933:
--
Labels: documentation pull-request-available  (was: documentation)

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Priority: Trivial
>  Labels: documentation, pull-request-available
> Fix For: 2.3.2
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Alex Hagerman (JIRA)
Alex Hagerman created SPARK-25933:
-

 Summary: Fix pstats reference for spark.python.profile.dump in 
configuration.md
 Key: SPARK-25933
 URL: https://issues.apache.org/jira/browse/SPARK-25933
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.3.2
Reporter: Alex Hagerman
 Fix For: 2.3.2


ptats.Stats() should be pstats.Stats() in 
https://spark.apache.org/docs/latest/configuration.html for 
spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25906) spark-shell cannot handle `-i` option correctly

2018-11-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25906:
-
Target Version/s: 2.4.1, 3.0.0  (was: 2.4.1)

> spark-shell cannot handle `-i` option correctly
> ---
>
> Key: SPARK-25906
> URL: https://issues.apache.org/jira/browse/SPARK-25906
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is a regression on Spark 2.4.0.
> *Spark 2.3.2*
> {code:java}
> $ cat test.scala
> spark.version
> case class Record(key: Int, value: String)
> spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
> s"val_$i"))).toDF.show
> $ bin/spark-shell -i test.scala
> 18/10/31 23:22:43 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1541053368478).
> Spark session available as 'spark'.
> Loading test.scala...
> res0: String = 2.3.2
> defined class Record
> 18/10/31 23:22:56 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> +---+-+
> |key|value|
> +---+-+
> |  1|val_1|
> |  2|val_2|
> +---+-+
> {code}
> *Spark 2.4.0 RC5*
> {code:java}
> $ bin/spark-shell -i test.scala
> 2018-10-31 23:23:14 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1541053400312).
> Spark session available as 'spark'.
> test.scala:17: error: value toDF is not a member of 
> org.apache.spark.rdd.RDD[Record]
> Error occurred in an application involving default arguments.
>spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
> s"val_$i"))).toDF.show
> {code}
> *WORKAROUND*
> Add the following line at the first of the script.
> {code}
> import spark.implicits._
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25906) spark-shell cannot handle `-i` option correctly

2018-11-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25906:
-
Target Version/s: 2.4.1

> spark-shell cannot handle `-i` option correctly
> ---
>
> Key: SPARK-25906
> URL: https://issues.apache.org/jira/browse/SPARK-25906
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is a regression on Spark 2.4.0.
> *Spark 2.3.2*
> {code:java}
> $ cat test.scala
> spark.version
> case class Record(key: Int, value: String)
> spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
> s"val_$i"))).toDF.show
> $ bin/spark-shell -i test.scala
> 18/10/31 23:22:43 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1541053368478).
> Spark session available as 'spark'.
> Loading test.scala...
> res0: String = 2.3.2
> defined class Record
> 18/10/31 23:22:56 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> +---+-+
> |key|value|
> +---+-+
> |  1|val_1|
> |  2|val_2|
> +---+-+
> {code}
> *Spark 2.4.0 RC5*
> {code:java}
> $ bin/spark-shell -i test.scala
> 2018-10-31 23:23:14 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1541053400312).
> Spark session available as 'spark'.
> test.scala:17: error: value toDF is not a member of 
> org.apache.spark.rdd.RDD[Record]
> Error occurred in an application involving default arguments.
>spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
> s"val_$i"))).toDF.show
> {code}
> *WORKAROUND*
> Add the following line at the first of the script.
> {code}
> import spark.implicits._
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25932) why GraphX vertexId should be Long ?

2018-11-03 Thread Ali Zadedehbalaei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Zadedehbalaei updated SPARK-25932:
--
Target Version/s:   (was: 2.3.2)
   Fix Version/s: (was: 2.3.2)

> why GraphX vertexId should be Long ?
> 
>
> Key: SPARK-25932
> URL: https://issues.apache.org/jira/browse/SPARK-25932
> Project: Spark
>  Issue Type: Question
>  Components: GraphX
>Affects Versions: 2.3.2
>Reporter: Ali Zadedehbalaei
>Priority: Critical
>
> Hi,
> Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be 
> able to use {{UUID}} as the vertex ID type because the data I want to process 
> with GraphX uses that type for its primay-keys.
> Given that the UUID is a unique identifier and its length is constant, why 
> can not it be used?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25932) why GraphX vertexId should be Long ?

2018-11-03 Thread Ali Zadedehbalaei (JIRA)
Ali Zadedehbalaei created SPARK-25932:
-

 Summary: why GraphX vertexId should be Long ?
 Key: SPARK-25932
 URL: https://issues.apache.org/jira/browse/SPARK-25932
 Project: Spark
  Issue Type: Question
  Components: GraphX
Affects Versions: 2.3.2
Reporter: Ali Zadedehbalaei
 Fix For: 2.3.2


Hi,
Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be able 
to use {{UUID}} as the vertex ID type because the data I want to process with 
GraphX uses that type for its primay-keys.
Given that the UUID is a unique identifier and its length is constant, why can 
not it be used?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources

2018-11-03 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674007#comment-16674007
 ] 

Maxim Gekk commented on SPARK-17967:


What about to preserve existing API as is, and pass multiple values as CSV 
string? For example:
{code:scala}
spark.read.format("csv")
  .option("nullValue", "2012, Tesla, null"))
  ...
{code}
or
{code:sql}
CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (sep '|,-', ...)
{code}

> Support for list or other types as an option for datasources
> 
>
> Key: SPARK-17967
> URL: https://issues.apache.org/jira/browse/SPARK-17967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This was discussed in SPARK-17878
> For other datasources, it seems okay with string/long/boolean/double value as 
> an option but it seems it is not enough for the datasource such as CSV. As it 
> is an interface for other external datasources, I guess it'd affect several 
> ones out there.
> I took a look a first but it seems it'd be difficult to support this (need to 
> change a lot).
> One suggestion is support this as a JSON array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25931:


Assignee: Apache Spark

> Benchmarking creation of Jackson parser
> ---
>
> Key: SPARK-25931
> URL: https://issues.apache.org/jira/browse/SPARK-25931
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
> invoke Jackson parser at all due to an optimization for empty schema 
> introduced SPARK-24959. Need to add new benchmark which should forcibly 
> create Jackson parser for short and wide columns. For example:
> {code:scala}
>  spark.read
>   .schema(schema)
>   .json(path)
>   .filter((_: Row) => true)
>   .count()
> {code}
> The *.filter((_: Row) => true)* prevents projection pushdown to JSON 
> datasource and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673980#comment-16673980
 ] 

Apache Spark commented on SPARK-25931:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22920

> Benchmarking creation of Jackson parser
> ---
>
> Key: SPARK-25931
> URL: https://issues.apache.org/jira/browse/SPARK-25931
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
> invoke Jackson parser at all due to an optimization for empty schema 
> introduced SPARK-24959. Need to add new benchmark which should forcibly 
> create Jackson parser for short and wide columns. For example:
> {code:scala}
>  spark.read
>   .schema(schema)
>   .json(path)
>   .filter((_: Row) => true)
>   .count()
> {code}
> The *.filter((_: Row) => true)* prevents projection pushdown to JSON 
> datasource and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25931:


Assignee: (was: Apache Spark)

> Benchmarking creation of Jackson parser
> ---
>
> Key: SPARK-25931
> URL: https://issues.apache.org/jira/browse/SPARK-25931
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
> invoke Jackson parser at all due to an optimization for empty schema 
> introduced SPARK-24959. Need to add new benchmark which should forcibly 
> create Jackson parser for short and wide columns. For example:
> {code:scala}
>  spark.read
>   .schema(schema)
>   .json(path)
>   .filter((_: Row) => true)
>   .count()
> {code}
> The *.filter((_: Row) => true)* prevents projection pushdown to JSON 
> datasource and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25931) Benchmarking creation of Jackson parser

2018-11-03 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25931:
--

 Summary: Benchmarking creation of Jackson parser
 Key: SPARK-25931
 URL: https://issues.apache.org/jira/browse/SPARK-25931
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Existing JSON benchmarks perlineParsing and perlineParsingOfWideColumn don't 
invoke Jackson parser at all due to an optimization for empty schema introduced 
SPARK-24959. Need to add new benchmark which should forcibly create Jackson 
parser for short and wide columns. For example:
{code:scala}
 spark.read
  .schema(schema)
  .json(path)
  .filter((_: Row) => true)
  .count()
{code}
The *.filter((_: Row) => true)* prevents projection pushdown to JSON datasource 
and forces fully parsing of JSON content.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

2018-11-03 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673957#comment-16673957
 ] 

Eyal Farago commented on SPARK-25548:
-

[~eaton], I think there are two possible approaches to handle this:

first would be extracting the partitions predicate and _And_ing it with the 
original predicate:
{code:java}
select * from src_par where 
(P_d in (2,3)) and 
((p_d=2 and key=2) or (p_d=3 and key=3))
{code}
second approach would be transforming this into a union:
{code:java}
select * from src_par where (p_d=2 and key=2) 
UNION ALL
select * from src_par where (p_d=3 and key=3)
{code}
I think second approach is easier to implement but it'd require additional 
rules to make sure partitioned are not scanned multiple times, ie. consider 
what'd happen if your predicate looked like this:
{code:java}
 (p_d=2 and key=2) or (p_d=3 and key=3) or (p_d=2 and key=33)
{code}
a naive approach would scan partition #2 twice while it's pretty obvious this 
can be avoided by _OR_ing the first and third conditions.

The first approach seems a bit more complicated by I think it somewhat 
resembles what you've started implementing in your pr, [~cloud_fan] your 
thoughts?

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field 
> with true in the And(partitionOps, nonPartitionOps) to make the partition can 
> be pruned
> -
>
> Key: SPARK-25548
> URL: https://issues.apache.org/jira/browse/SPARK-25548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be 
> pruned if we use partition filter and non partition filter together, for 
> example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
> by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
> value")
> The sql below will scan all the partition files, in which, the partition 
> **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
> key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-11-03 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673949#comment-16673949
 ] 

Eyal Farago commented on SPARK-24437:
-

[~dvogelbacher],

haven't looked to deep into this, but here's two immediate conclusions from the 
screen shot you've attached:
 # broadcast is referenced from MapPartitionsRDD's f member, this seems 
reasonable for a broadcast join.
 # the entire thing is cached (CachedRDDBuilder), is it possible you're caching 
this DataSet? unlike RDDs DataSet's persistence is manually managed - hence 
they're not automatically garbage collected once the last reference is dropped.

having that said, I'd still expect spark to cache only the in-memory 
representation and not the entire RDD lineage, so this does look like some sort 
of a bug, something like an over capturing function/closure in the caching code.

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png, Screen Shot 2018-11-01 at 10.38.30 AM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673948#comment-16673948
 ] 

Apache Spark commented on SPARK-25102:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22932

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Description: 
Currently, Spark writes Spark version number into Hive Table properties with 
`spark.sql.create.version`.
{code}
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
{code}

This issue aims to write Spark versions to ORC/Parquet file metadata with 
`org.apache.spark.sql.create.version`. It's different from Hive Table property 
key `spark.sql.create.version`. It seems that we cannot change that for 
backward compatibility (even in Apache Spark 3.0)

*ORC*
{code}
User Metadata:
  org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
{code}

*PARQUET*
{code}
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
{code}

  was:
Currently, Spark writes Spark version number into Hive Table properties with 
`spark.sql.create.version`.
{code}
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
{code}

This issue aims to write Spark versions to ORC/Parquet file metadata with 
`org.apache.spark.sql.create.version`. It's different from Hive Table property 
key `spark.sql.create.version`. It seems that we cannot change that for 
backward compatibility (even in Apache Spark 3.0)

*PARQUET*
{code}
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
{code}


> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Description: 
Currently, Spark writes Spark version number into Hive Table properties with 
`spark.sql.create.version`.
{code}
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
{code}

This issue aims to write Spark versions to ORC/Parquet file metadata with 
`org.apache.spark.sql.create.version`. It's different from Hive Table property 
key `spark.sql.create.version`. It seems that we cannot change that for 
backward compatibility (even in Apache Spark 3.0)

*PARQUET*
{code}
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
{code}

  was:
Currently, Spark writes Spark version number into Hive Table properties with 
`spark.sql.create.version`.
{code}
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
{code}

This issue aims to write Spark versions to ORC/Parquet file metadata 
consistently with `org.apache.spark.sql.create.version`.

*PARQUET*
{code}
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
{code}


> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Description: 
Currently, Spark writes Spark version number into Hive Table properties with 
`spark.sql.create.version`.
{code}
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
{code}

This issue aims to write Spark versions to ORC/Parquet file metadata 
consistently with `org.apache.spark.sql.create.version`.

*PARQUET*
{code}
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
{code}

  was:
Currently, Spark writes Spark version number into Hive Table properties with 
`org.apache.spark.sql.create.version`.
{code}
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
{code}

This issue aims to write Spark versions to ORC/Parquet file metadata 
consistently.

*PARQUET*
{code}
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
{code}


> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata 
> consistently with `org.apache.spark.sql.create.version`.
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Description: 
Currently, Spark writes Spark version number into Hive Table properties with 
`org.apache.spark.sql.create.version`.
{code}
parameters:{
  spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
{code}

This issue aims to write Spark versions to ORC/Parquet file metadata 
consistently.

*PARQUET*
{code}
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
{code}

  was:
Currently, Spark writes Spark version number into Hive Table properties with 
`org.apache.spark.sql.create.version`.

This issue aims to write Spark versions to ORC/Parquet file metadata 
consistently.


> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `org.apache.spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata 
> consistently.
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to Parquet/ORC file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Description: 
Currently, Spark writes Spark version number into Hive Table properties with 
`org.apache.spark.sql.create.version`.

This issue aims to write Spark versions to ORC/Parquet file metadata 
consistently.

  was:
-PARQUET-352- added support for the "writer.model.name" property in the Parquet 
metadata to identify the object model (application) that wrote the file.

The easiest way to write this property is by overriding getName() of 
org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
getName() to the 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.


> Write Spark version to Parquet/ORC file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `org.apache.spark.sql.create.version`.
> This issue aims to write Spark versions to ORC/Parquet file metadata 
> consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Summary: Write Spark version to ORC/Parquet file metadata  (was: Write 
Spark version to Parquet/ORC file metadata)

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `org.apache.spark.sql.create.version`.
> This issue aims to write Spark versions to ORC/Parquet file metadata 
> consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to Parquet/ORC file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Summary: Write Spark version to Parquet/ORC file metadata  (was: Write 
Spark version information to Parquet file footers)

> Write Spark version to Parquet/ORC file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Minor
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to Parquet/ORC file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Priority: Major  (was: Minor)

> Write Spark version to Parquet/ORC file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25102) Write Spark version to Parquet/ORC file metadata

2018-11-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25102:
--
Affects Version/s: (was: 2.3.1)
   3.0.0

> Write Spark version to Parquet/ORC file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc

2018-11-03 Thread Xingbo Jiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-25901.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22912
[https://github.com/apache/spark/pull/22912]

> Barrier mode spawns a bunch of threads that get collected on gc
> ---
>
> Key: SPARK-25901
> URL: https://issues.apache.org/jira/browse/SPARK-25901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: yogesh garg
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 
> 2018-10-31 at 11.57.42 AM.png
>
>
> After a barrier job is terminated (successfully or interrupted), the 
> accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
> waiting state until gc is called. We should probably have just one thread to 
> schedule all such tasks, since they just log every 60 seconds.
> Here's a screen shot of the threads growing with more tasks:
>  !Screen Shot 2018-10-31 at 11.57.25 AM.png! 
> Here's a screen shot of constant number of threads with more tasks:
>  !Screen Shot 2018-10-31 at 11.57.42 AM.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24758) Create table wants to use /user/hive/warehouse in clean clone

2018-11-03 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673939#comment-16673939
 ] 

Yuming Wang commented on SPARK-24758:
-

cc [~Qin Yao]

> Create table wants to use /user/hive/warehouse in clean clone
> -
>
> Key: SPARK-24758
> URL: https://issues.apache.org/jira/browse/SPARK-24758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> Got a clean clone of repository:
>  - git clone [https://github.com/apache/spark.git] spark_clean
>  - cd spark_clean/
>  - ./build/sbt -Phive -Phive-thriftserver clean package
> Ran spark-sql and tried to create a table:
>  - ./bin/spark-sql
>  - create table testit as select 1 a, 2 b;
> Got error:
> {noformat}
> 18/07/07 13:33:20 WARN HiveMetaStore: Location: 
> file:/user/hive/warehouse/testit specified for non-external table:testit
> 18/07/07 13:33:20 INFO FileUtils: Creating directory if it doesn't exist: 
> file:/user/hive/warehouse/testit
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:file:/user/hive/warehouse/testit is not a directory or 
> unable to create one);
> {noformat}
> To get things working, it seems you need to do something with dataframewriter 
> (after removing metastore_db, if it exists):
> {noformat}
> scala> spark.range(0,1).write.saveAsTable("fred")
> spark.range(0,1).write.saveAsTable("fred")
> 18/07/07 14:08:08 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 18/07/07 14:08:08 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> scala> 
> {noformat}
> After that, create table statements work:
> {noformat}
> spark-sql> create table testit as select 1 a, 2 b;
> create table testit as select 1 a, 2 b;
> 18/07/07 14:14:40 WARN HiveMetaStore: Location: 
> file:/spark-warehouse/testit specified for non-external 
> table:testit
> 18/07/07 14:14:41 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Time taken: 3.387 seconds
> spark-sql> show tables;
> show tables;
> default fredfalse
> default testit  false
> Time taken: 0.07 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org