[jira] [Created] (SPARK-30832) SQL function doc headers should link to anchors

2020-02-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30832:


 Summary: SQL function doc headers should link to anchors
 Key: SPARK-30832
 URL: https://issues.apache.org/jira/browse/SPARK-30832
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30731) Refine doc-building workflow

2020-02-04 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30731:


 Summary: Refine doc-building workflow
 Key: SPARK-30731
 URL: https://issues.apache.org/jira/browse/SPARK-30731
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


There are a few rough edges in the workflow for building docs that could be 
refined:
 * sudo pip installing stuff
 * no pinned versions of any doc dependencies
 * using some deprecated options
 * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30510) Publicly document options under spark.sql.*

2020-01-31 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30510:
-
Description: 
SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
but it doesn't appear to be documented in [the expected 
place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none of 
the options under {{spark.sql.*}} that are intended for users are documented on 
spark.apache.org/docs.

We should add a new documentation page for these options.

  was:SPARK-20236 added a new option, 
{{spark.sql.sources.partitionOverwriteMode}}, but it doesn't appear to be 
documented in [the expected 
place|http://spark.apache.org/docs/2.4.4/configuration.html].


> Publicly document options under spark.sql.*
> ---
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none 
> of the options under {{spark.sql.*}} that are intended for users are 
> documented on spark.apache.org/docs.
> We should add a new documentation page for these options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30510) Publicly document options under spark.sql.*

2020-01-31 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30510:
-
Summary: Publicly document options under spark.sql.*  (was: Document 
spark.sql.sources.partitionOverwriteMode)

> Publicly document options under spark.sql.*
> ---
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30665) Eliminate pypandoc dependency

2020-01-29 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30665:
-
Summary: Eliminate pypandoc dependency  (was: Remove Pandoc dependency in 
PySpark setup.py)

> Eliminate pypandoc dependency
> -
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30672) numpy is a dependency for building PySpark API docs

2020-01-29 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30672:


 Summary: numpy is a dependency for building PySpark API docs
 Key: SPARK-30672
 URL: https://issues.apache.org/jira/browse/SPARK-30672
 Project: Spark
  Issue Type: Bug
  Components: Build, PySpark
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


As described here: 
https://github.com/apache/spark/pull/27376#discussion_r372550656



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30665) Remove Pandoc dependency in PySpark setup.py

2020-01-29 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026399#comment-17026399
 ] 

Nicholas Chammas commented on SPARK-30665:
--



> Remove Pandoc dependency in PySpark setup.py
> 
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30665) Remove Pandoc dependency in PySpark setup.py

2020-01-28 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30665:


 Summary: Remove Pandoc dependency in PySpark setup.py
 Key: SPARK-30665
 URL: https://issues.apache.org/jira/browse/SPARK-30665
 Project: Spark
  Issue Type: Improvement
  Components: Build, PySpark
Affects Versions: 2.4.4, 2.4.3
Reporter: Nicholas Chammas


PyPI now supports Markdown project descriptions, so we no longer need to 
convert the Spark README into ReStructuredText and thus no longer need pypandoc.

Removing pypandoc has the added benefit of eliminating the failure mode 
described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022576#comment-17022576
 ] 

Nicholas Chammas commented on SPARK-19248:
--

Thanks for getting to the bottom of the issue, [~jeff.w.evans], and for 
providing a workaround.

Would an appropriate solution be to make {{escapedStringLiterals}} default to 
{{True}}? Or does that cause other problems?

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-23 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-30557.
--
Resolution: Won't Fix

> Add public documentation for SPARK_SUBMIT_OPTS
> --
>
> Key: SPARK-30557
> URL: https://issues.apache.org/jira/browse/SPARK-30557
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
> documentation. I cannot see it documented 
> [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS]
>  in the docs.
> How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-17 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018254#comment-17018254
 ] 

Nicholas Chammas commented on SPARK-30557:
--

[~vanzin] - Do you know if this is something we should document? Would the 
documentation go in [Submitting 
Applications|https://spark.apache.org/docs/latest/submitting-applications.html]?
 (Another possibility is to put it in spark-env.sh.template.)

> Add public documentation for SPARK_SUBMIT_OPTS
> --
>
> Key: SPARK-30557
> URL: https://issues.apache.org/jira/browse/SPARK-30557
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
> documentation. I cannot see it documented 
> [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS]
>  in the docs.
> How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-17 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30557:


 Summary: Add public documentation for SPARK_SUBMIT_OPTS
 Key: SPARK-30557
 URL: https://issues.apache.org/jira/browse/SPARK-30557
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Documentation
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
documentation. I cannot see it documented 
[anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS]
 in the docs.

How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode

2020-01-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015473#comment-17015473
 ] 

Nicholas Chammas commented on SPARK-30510:
--

[~hyukjin.kwon] I think I'm missing something here because it seems that none 
of the {{spark.sql.*}} options in 
[SQLConf.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala]
 are on the configurations page. Are they published somewhere else?

> Document spark.sql.sources.partitionOverwriteMode
> -
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode

2020-01-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30510:


 Summary: Document spark.sql.sources.partitionOverwriteMode
 Key: SPARK-30510
 URL: https://issues.apache.org/jira/browse/SPARK-30510
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
but it doesn't appear to be documented in [the expected 
place|http://spark.apache.org/docs/2.4.4/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30173) Automatically close stale PRs

2019-12-08 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30173:


 Summary: Automatically close stale PRs
 Key: SPARK-30173
 URL: https://issues.apache.org/jira/browse/SPARK-30173
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


To manage the number of open PRs we have at any one time, we should 
automatically close stale PRs with a friendly message.

Background discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Closing-stale-PRs-with-a-GitHub-Action-td28477.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30128) Promote remaining "hidden" PySpark DataFrameReader options to load APIs

2019-12-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30128:
-
Description: 
Following on to SPARK-29903 and similar issues (linked), there are options 
available to the DataFrameReader for certain source formats, but which are not 
exposed properly in the relevant APIs.

These options include `timeZone` and `pathGlobFilter`. Instead of being noted 
under [the option() 
method|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.option],
 they should be implemented directly into load APIs that support them.

  was:
Following on to SPARK-27990 and similar issues (linked), there are options 
available to the DataFrameReader for certain source formats, but which are not 
exposed properly in the relevant APIs.

These options include `timeZone` and `pathGlobFilter`.


> Promote remaining "hidden" PySpark DataFrameReader options to load APIs
> ---
>
> Key: SPARK-30128
> URL: https://issues.apache.org/jira/browse/SPARK-30128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on to SPARK-29903 and similar issues (linked), there are options 
> available to the DataFrameReader for certain source formats, but which are 
> not exposed properly in the relevant APIs.
> These options include `timeZone` and `pathGlobFilter`. Instead of being noted 
> under [the option() 
> method|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.option],
>  they should be implemented directly into load APIs that support them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30128) Promote remaining "hidden" PySpark DataFrameReader options to load APIs

2019-12-04 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30128:


 Summary: Promote remaining "hidden" PySpark DataFrameReader 
options to load APIs
 Key: SPARK-30128
 URL: https://issues.apache.org/jira/browse/SPARK-30128
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Nicholas Chammas


Following on to SPARK-27990 and similar issues (linked), there are options 
available to the DataFrameReader for certain source formats, but which are not 
exposed properly in the relevant APIs.

These options include `timeZone` and `pathGlobFilter`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27547) fix DataFrame self-join problems

2019-12-03 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987444#comment-16987444
 ] 

Nicholas Chammas commented on SPARK-27547:
--

Should this be marked as resolved by 
[#25107|https://github.com/apache/spark/pull/25107]?

 

> fix DataFrame self-join problems
> 
>
> Key: SPARK-27547
> URL: https://issues.apache.org/jira/browse/SPARK-27547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30091) Document mergeSchema option directly in the Python Parquet APIs

2019-12-03 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30091:
-
Summary: Document mergeSchema option directly in the Python Parquet APIs  
(was: Document mergeSchema option directly in the Python API)

> Document mergeSchema option directly in the Python Parquet APIs
> ---
>
> Key: SPARK-30091
> URL: https://issues.apache.org/jira/browse/SPARK-30091
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet]
> Strangely, the `mergeSchema` option is mentioned in the docstring but not 
> implemented in the method signature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30113) Document mergeSchema option in Python Orc APIs

2019-12-03 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30113:


 Summary: Document mergeSchema option in Python Orc APIs
 Key: SPARK-30113
 URL: https://issues.apache.org/jira/browse/SPARK-30113
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30091) Document mergeSchema option directly in the Python API

2019-12-01 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30091:
-
Affects Version/s: (was: 3.0.0)
   2.4.4

> Document mergeSchema option directly in the Python API
> --
>
> Key: SPARK-30091
> URL: https://issues.apache.org/jira/browse/SPARK-30091
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet]
> Strangely, the `mergeSchema` option is mentioned in the docstring but not 
> implemented in the method signature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30091) Document mergeSchema option directly in the Python API

2019-12-01 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30091:


 Summary: Document mergeSchema option directly in the Python API
 Key: SPARK-30091
 URL: https://issues.apache.org/jira/browse/SPARK-30091
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


[http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet]

Strangely, the `mergeSchema` option is mentioned in the docstring but not 
implemented in the method signature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30084) Add docs showing how to automatically rebuild Python API docs

2019-11-29 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30084:


 Summary: Add docs showing how to automatically rebuild Python API 
docs
 Key: SPARK-30084
 URL: https://issues.apache.org/jira/browse/SPARK-30084
 Project: Spark
  Issue Type: Improvement
  Components: Build, Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


`jekyll serve --watch` doesn't watch the API docs. That means you have to kill 
and restart jekyll every time you update your API docs, just to see the effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29903) Add documentation for recursiveFileLookup

2019-11-17 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976252#comment-16976252
 ] 

Nicholas Chammas commented on SPARK-29903:
--

Happy to do that. Going to wait for [this 
PR|https://github.com/apache/spark/pull/26525] to be completed before writing 
any docs though, so I can address both the DataFrame and SQL APIs in one go.

> Add documentation for recursiveFileLookup
> -
>
> Key: SPARK-29903
> URL: https://issues.apache.org/jira/browse/SPARK-29903
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
> loading data from a source directory. There is currently no documentation for 
> this option.
> We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29903) Add documentation for recursiveFileLookup

2019-11-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974577#comment-16974577
 ] 

Nicholas Chammas commented on SPARK-29903:
--

cc [~cloud_fan] and [~weichenxu123]

> Add documentation for recursiveFileLookup
> -
>
> Key: SPARK-29903
> URL: https://issues.apache.org/jira/browse/SPARK-29903
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
> loading data from a source directory. There is currently no documentation for 
> this option.
> We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29903) Add documentation for recursiveFileLookup

2019-11-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-29903:


 Summary: Add documentation for recursiveFileLookup
 Key: SPARK-29903
 URL: https://issues.apache.org/jira/browse/SPARK-29903
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
loading data from a source directory. There is currently no documentation for 
this option.

We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27990) Provide a way to recursively load data from datasource

2019-11-07 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969462#comment-16969462
 ] 

Nicholas Chammas edited comment on SPARK-27990 at 11/7/19 5:54 PM:
---

Are there any docs for this new {{recursiveFileLookup}} option? I can't find 
anything.


was (Author: nchammas):
Are there any docs for this new option? I can't find anything.

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource

2019-11-07 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969462#comment-16969462
 ] 

Nicholas Chammas commented on SPARK-27990:
--

Are there any docs for this new option? I can't find anything.

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-16483) Unifying struct fields and columns

2019-10-14 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-16483:
--

Though this has been bulk closed, I still think it's a valuable potential 
improvement to the DataFrame API that makes working with nested datasets much 
more seamless and natural. (SPARK-18084 and SPARK-18277 are two hiccups I hit 
myself that would be addressed by this ticket.)

I hope this will get another look after Spark 3.0 is out.

> Unifying struct fields and columns
> --
>
> Key: SPARK-16483
> URL: https://issues.apache.org/jira/browse/SPARK-16483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: bulk-closed, sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> deeply nested structs.
> Common use-cases include:
>  - Remove and/or rename struct field(s) to adjust the schema
>  - Fix a data quality issue with a struct field (update/rewrite)
> To do this with the existing API by hand requires manually calling 
> {{named_struct}} and listing all fields, including ones we don't want to 
> manipulate. This leads to complex, fragile code that cannot survive schema 
> evolution.
> It would be far better if the various APIs that can now manipulate top-level 
> columns were extended to handle struct fields at arbitrary locations or, 
> alternatively, if we introduced new APIs for modifying any field in a 
> dataframe, whether it is a top-level one or one nested inside a struct.
> Purely for discussion purposes (overloaded methods are not shown):
> {code:java}
> class Column(val expr: Expression) extends Logging {
>   // ...
>   // matches Dataset.schema semantics
>   def schema: StructType
>   // matches Dataset.select() semantics
>   // '* support allows multiple new fields to be added easily, saving 
> cumbersome repeated withColumn() calls
>   def select(cols: Column*): Column
>   // matches Dataset.withColumn() semantics of add or replace
>   def withColumn(colName: String, col: Column): Column
>   // matches Dataset.drop() semantics
>   def drop(colName: String): Column
> }
> class Dataset[T] ... {
>   // ...
>   // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
>   def cast(newShema: StructType): DataFrame
> }
> {code}
> The benefit of the above API is that it unifies manipulating top-level & 
> nested columns. The addition of {{schema}} and {{select()}} to {{Column}} 
> allows for nested field reordering, casting, etc., which is important in data 
> exchange scenarios where field position matters. That's also the reason to 
> add {{cast}} to {{Dataset}}: it improves consistency and readability (with 
> method chaining). Another way to think of {{Dataset.cast}} is as the Spark 
> schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala 
> encodable type is to a {{StructType}} instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29280) DataFrameReader should support a compression option

2019-09-27 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-29280:
-
Description: 
[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example, I think all of the following should be possible:
{code:python}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec')
spark.read.json(..., compression='none')
{code}

  was:
[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example:
{code:java}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}


> DataFrameReader should support a compression option
> ---
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
>  supports a {{compression}} option, but 
> [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
>  doesn't. The lack of a {{compression}} option in the reader causes some 
> 

[jira] [Comment Edited] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939856#comment-16939856
 ] 

Nicholas Chammas edited comment on SPARK-29102 at 9/28/19 5:35 AM:
---

I figured it out. Looks like the correct setting is {{io.compression.codecs,}} 
as instructed in the 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] repo, not 
{{spark.hadoop.io.compression.codecs}}. I mistakenly tried that after seeing 
others use it elsewhere.

So in summary, to read gzip files in Spark using this codec, you need to:
 # Start up Spark with "{{--packages nl.basjes.hadoop:splittablegzip:1.2}}".
 # Then, enable the new codec with "{{spark.conf.set('io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')}}".
 # From there, you can read gzipped CSVs as you would normally, via 
"{{spark.read.csv(...)}}".

I've confirmed that, using this codec, Spark loads a single gzipped file with 
multiple concurrent tasks (and without the codec it only runs one task). I 
haven't done any further testing to see what performance benefits there are in 
a realistic use case, but if this codec works as described in its README then 
that should be good enough for me!

I've filed SPARK-29280 about adding a {{compression}} option to 
{{DataFrameReader}} to match {{DataFrameWriter}} and make this kind of workflow 
a bit more straightforward.


was (Author: nchammas):
I figured it out. Looks like the correct setting is {{io.compression.codecs,}} 
as instructed in the 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] repo, not 
{{spark.hadoop.io.compression.codecs}}. I mistakenly tried that after seeing 
others use it elsewhere.

So in summary, to read gzip files in Spark using this codec, you need to:
 # Start up Spark with "{{--packages nl.basjes.hadoop:splittablegzip:1.2}}".
 # Then, enable the new codec with "{{spark.conf.set('io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')}}".
 # From there, you can read gzipped CSVs as you would normally, via 
"{{spark.read.csv(...)}}".

I've confirmed that, using this codec, Spark loads a single gzipped file with 
multiple concurrent tasks (and without the codec it only runs one task). I 
haven't done any further testing to see what performance benefits there are in 
a realistic use case, but if this codec works as described in its README then 
that should be good enough for me!

At this point I think all that remains is for me to file a Jira about adding a 
{{compression}} option to {{DataFrameReader}} to match {{DataFrameWriter}} and 
make this kind of workflow a bit more straightforward.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless 

[jira] [Commented] (SPARK-29280) DataFrameReader should support a compression option

2019-09-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939864#comment-16939864
 ] 

Nicholas Chammas commented on SPARK-29280:
--

cc [~hyukjin.kwon], [~cloud_fan]

> DataFrameReader should support a compression option
> ---
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
>  supports a {{compression}} option, but 
> [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
>  doesn't. The lack of a {{compression}} option in the reader causes some 
> friction in the following cases:
>  # You want to read some data compressed with a codec that Spark does not 
> [load by 
> default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
>  # You want to read some data with a codec that overrides one of the built-in 
> codecs that Spark supports.
>  # You want to explicitly instruct Spark on what codec to use on read when it 
> will not be able to correctly auto-detect it (e.g. because the file extension 
> is [missing,|https://stackoverflow.com/q/52011697/877069] 
> [non-standard|https://stackoverflow.com/q/44372995/877069], or 
> [incorrect|https://stackoverflow.com/q/49110384/877069]).
> Case #2 came up in SPARK-29102. There is a very handy library called 
> [SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
> load a single gzipped file using multiple concurrent tasks. (You can see the 
> details of how it works and why it's useful in the project README and in 
> SPARK-29102.)
> To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
> Hadoop filesystem API setting, since it [doesn't appear to be documented by 
> Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
> there is also a setting called {{spark.io.compression.codec}}, which seems to 
> be for a different purpose.
> It would be much clearer for the user and more consistent with the writer 
> interface if the reader let you directly specify the codec.
> For example:
> {code:java}
> spark.read.option('compression', 'lz4').csv(...)
> spark.read.csv(..., 
> compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29280) DataFrameReader should support a compression option

2019-09-27 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-29280:


 Summary: DataFrameReader should support a compression option
 Key: SPARK-29280
 URL: https://issues.apache.org/jira/browse/SPARK-29280
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example:
{code:java}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939856#comment-16939856
 ] 

Nicholas Chammas commented on SPARK-29102:
--

I figured it out. Looks like the correct setting is {{io.compression.codecs,}} 
as instructed in the 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] repo, not 
{{spark.hadoop.io.compression.codecs}}. I mistakenly tried that after seeing 
others use it elsewhere.

So in summary, to read gzip files in Spark using this codec, you need to:
 # Start up Spark with "{{--packages nl.basjes.hadoop:splittablegzip:1.2}}".
 # Then, enable the new codec with "{{spark.conf.set('io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')}}".
 # From there, you can read gzipped CSVs as you would normally, via 
"{{spark.read.csv(...)}}".

I've confirmed that, using this codec, Spark loads a single gzipped file with 
multiple concurrent tasks (and without the codec it only runs one task). I 
haven't done any further testing to see what performance benefits there are in 
a realistic use case, but if this codec works as described in its README then 
that should be good enough for me!

At this point I think all that remains is for me to file a Jira about adding a 
{{compression}} option to {{DataFrameReader}} to match {{DataFrameWriter}} and 
make this kind of workflow a bit more straightforward.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-23 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936077#comment-16936077
 ] 

Nicholas Chammas commented on SPARK-29102:
--

I wonder if 
[newAPIHadoopFile|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.newAPIHadoopFile],
 which lets you specify explicit classes to read the data with, would work 
here. I'll try that out later.

[~hyukjin.kwon] - Have you had any luck with 
{{spark.hadoop.io.compression.codecs}}?

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933803#comment-16933803
 ] 

Nicholas Chammas commented on SPARK-29102:
--

[~hyukjin.kwon] - Would you happen to know how to instruct Spark to use a 
custom codec _on read_?

I'm trying out that {{SplittableGzipCodec}} but I can't seem to get Spark to 
actually use it. I'm starting up PySpark as follows:
{code:java}
pyspark --packages nl.basjes.hadoop:splittablegzip:1.2{code}
Then I'm trying to read a gzipped CSV as follows:
{code:java}
spark.conf.set('spark.hadoop.io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')
spark.read.csv(...).count() {code}
But Spark doesn't seem to be using the codec.

I know Spark can "see" the codec, because I can use it on write:
{code:java}
spark.range(10).write.csv('test.csv', mode='overwrite', 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}
However, Spark doesn't offer a {{compression}} option on read.

Do you know how I can get PySpark to use this codec on read? Apologies if this 
not appropriate for JIRA. I can take it to StackOverflow if you prefer.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932997#comment-16932997
 ] 

Nicholas Chammas commented on SPARK-29102:
--

{quote}It duplicately decompresses and each map task process what they want. 
And then, each map task stops decompressing if they processes what they want.
{quote}
Yup, that's what I was suggesting in this issue. Glad some folks have already 
tried that out. Hopefully, I'll get lucky and 
{{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} will just work for me.
{quote}We could resolve this JIRA but if you feel like it's still feasible, I 
don't mind leaving this JIRA open.
{quote}
I've resolved it for now as "Won't Fix". I'll report back here if the solution 
you pointed me to works.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-29102.
--
Resolution: Won't Fix

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932953#comment-16932953
 ] 

Nicholas Chammas commented on SPARK-29102:
--

Ah, thanks for the reference! So if I'm just trying to read gzipped CSV or JSON 
text files, then  {{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} may 
already provide a solution today, correct?

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-16 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930835#comment-16930835
 ] 

Nicholas Chammas commented on SPARK-29102:
--

cc [~cloud_fan] and [~hyukjin.kwon]: I noticed your work and comments on the 
PRs for SPARK-28366, so you may be interested in this issue.

Does this idea make sense? Does it seem feasible in theory at least?

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-16 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-29102:


 Summary: Read gzipped file into multiple partitions without full 
gzip expansion on a single-node
 Key: SPARK-29102
 URL: https://issues.apache.org/jira/browse/SPARK-29102
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


Large gzipped files are a common stumbling block for new users (SPARK-5685, 
SPARK-28366) and an ongoing pain point for users who must process such files 
delivered from external parties who can't or won't break them up into smaller 
files or compress them using a splittable compression format like bzip2.

To deal with large gzipped files today, users must either load them via a 
single task and then repartition the resulting RDD or DataFrame, or they must 
launch a preprocessing step outside of Spark to split up the file or recompress 
it using a splittable format. In either case, the user needs a single host 
capable of holding the entire decompressed file.

Spark can potentially a) spare new users the confusion over why only one task 
is processing their gzipped data, and b) relieve new and experienced users 
alike from needing to maintain infrastructure capable of decompressing a large 
gzipped file on a single node, by directly loading gzipped files into multiple 
partitions across the cluster.

The rough idea is to have tasks divide a given gzipped file into ranges and 
then have them all concurrently decompress the file, with each task throwing 
away the data leading up to the target range. (This kind of partial 
decompression is apparently [doable using standard Unix 
utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
doable in Spark too.)

In this way multiple tasks can concurrently load a single gzipped file into 
multiple partitions. Even though every task will need to unpack the file from 
the beginning to the task's target range, and the stage will run no faster than 
what it would take with Spark's current gzip loading behavior, this nonetheless 
addresses the two problems called out above. Users no longer need to load and 
then repartition gzipped files, and their infrastructure does not need to 
decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25603) Generalize Nested Column Pruning

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910738#comment-16910738
 ] 

Nicholas Chammas commented on SPARK-25603:
--

[~dbtsai] - Just watched [your Spark Summit talk on this 
issue|https://www.youtube.com/watch?v=qHIA5YbZ8_4]. Thanks for all the work you 
and others have done here.

Will there be any spillover benefits from this effort for other issues 
affecting how Spark handles nested fields, like SPARK-16483 or SPARK-18084?

> Generalize Nested Column Pruning
> 
>
> Key: SPARK-25603
> URL: https://issues.apache.org/jira/browse/SPARK-25603
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910655#comment-16910655
 ] 

Nicholas Chammas edited comment on SPARK-4502 at 8/19/19 7:55 PM:
--

Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be 
enabled by default as part of SPARK-27644.

-With regards to aggregates breaking pruning, have you reported that somewhere? 
If not, I recommend reporting it and linking to the new issue from here.-

Looks like the problem with aggregates breaking schema pruning is already being 
tracked in SPARK-27217.


was (Author: nchammas):
Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be 
enabled by default as part of SPARK-27644.

With regards to aggregates breaking pruning, have you reported that somewhere? 
If not, I recommend reporting it and linking to the new issue from here.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 2.4.0
>
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910691#comment-16910691
 ] 

Nicholas Chammas edited comment on SPARK-25150 at 8/19/19 7:39 PM:
---

I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * [^persons.csv]
 * [^states.csv]
 * [^zombie-analysis.py]

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]


was (Author: nchammas):
I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * !persons.csv|width=7,height=7,align=absmiddle!
 * !states.csv|width=7,height=7,align=absmiddle!
 * [^zombie-analysis.py] !zombie-analysis.py|width=7,height=7,align=absmiddle!

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Affects Version/s: 2.4.3
   Labels: correctness  (was: )

I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * !persons.csv|width=7,height=7,align=absmiddle!
 * !states.csv|width=7,height=7,align=absmiddle!
 * [^zombie-analysis.py] !zombie-analysis.py|width=7,height=7,align=absmiddle!

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Labels: correctness  (was: )

Tagging this as a correctness issue since Spark 2+'s output differ's both from 
Python's as well as from Spark 1.6's.

Python 3.7.4 + Spark 2.4.3:
{code:java}
>>> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
>>> df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
[Row(col='5')]
>>> df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
[Row(col='')]  <-- This differs from Python's output as well as Spark 1.6's 
output.
>>> import re
>>> re.sub(pattern='( |\.)*', repl='', string='..   5.')
'5'{code}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18084:
-
Affects Version/s: 2.4.3

Retested and confirmed that this issue is still present in Spark 2.4.3.

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",

[jira] [Updated] (SPARK-10892) Join with Data Frame returns wrong results

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-10892:
-
Affects Version/s: 2.4.0
   Labels: correctness  (was: )

Updating affected version and label per the comments.

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 2.4.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
>  Labels: correctness
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  

[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910655#comment-16910655
 ] 

Nicholas Chammas commented on SPARK-4502:
-

Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be 
enabled by default as part of SPARK-27644.

With regards to aggregates breaking pruning, have you reported that somewhere? 
If not, I recommend reporting it and linking to the new issue from here.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 2.4.0
>
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT

2019-05-22 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16824:
-
Labels:   (was: bulk-closed)

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-16824) Add API docs for VectorUDT

2019-05-22 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-16824:
--

Reviewing the links here, it seems that VectorUDT has been in use since 2016 at 
the earliest, but it still doesn't appear to be documented. Reopening this 
issue and updating the tag.

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: bulk-closed
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT

2019-05-22 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16824:
-
Affects Version/s: 2.4.3

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: bulk-closed
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-21 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Labels:   (was: bulk-closed)

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-21 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Affects Version/s: 2.4.3

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-21 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-19248:
--

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-21 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Component/s: PySpark

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-21 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844547#comment-16844547
 ] 

Nicholas Chammas commented on SPARK-19248:
--

Looks like Spark 2.4.3 still exhibits the behavior reported in the original 
issue: 
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
  /_/

Using Python version 3.7.3 (default, Mar 27 2019 13:25:00)
SparkSession available as 'spark'.
>>> df = spark.createDataFrame([('..   5.',)], ['col'])
>>> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS 
>>> col"]).collect()
>>> dfout   
[Row(col='5')]
>>> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
>>> col"]).collect()
>>> dfout2
[Row(col='')]
>>> 
{code}
[~hyukjin.kwon] - I'm going to reopen this issue.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2019-05-21 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844530#comment-16844530
 ] 

Nicholas Chammas commented on SPARK-18277:
--

[~hyukjin.kwon] - If I still think this issue is relevant, should I just reopen 
it?

> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: bulk-closed
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs. If you try {{when()}}, you realize that you cannot do 
> {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
> appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
> '')}} it doesn't catch the null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2018-10-30 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668941#comment-16668941
 ] 

Nicholas Chammas commented on SPARK-10892:
--

Is this issue still present in Spark 2.3.2 or 2.4.0, and if so, shouldn't we 
mark it with the {{correctness}} label?

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632512#comment-16632512
 ] 

Nicholas Chammas commented on SPARK-25150:
--

Correct, this isn't a cross join. It's just a plain inner join.

In theory, whether cross joins are enabled or not should have no bearing on the 
result. However, what we're seeing is that without them enabled we get an 
incorrect error and with them enabled we get incorrect results.

If we were actually trying a cross join (i.e. no {{on=(...)}} condition 
specified) I think those results (with the 4 output rows) would still be 
incorrect since you'd expect NH's population to be combined with RI's stats in 
one of the output rows, but that's not the case. You'd also expect MA to show 
up in the output, too.

> The second join joins on a column in {{states}}, but that is not a DataFrame 
> used in that join. Is that the problem?

Not sure what you mean here. Both joins join on {{states}}, which is the first 
DataFrame in the definition of {{analysis}}.

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632381#comment-16632381
 ] 

Nicholas Chammas commented on SPARK-25150:
--

([~petertoth] - Seeing your comment edit now.) OK, so it seems the two problems 
I identified are accurate, but they have a common root cause. Thanks for 
confirming.

[~srowen] - Given Peter's confirmation that the results with cross join enabled 
are incorrect, I believe we should mark this as a correctness issue.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632281#comment-16632281
 ] 

Nicholas Chammas commented on SPARK-25150:
--

I've uploaded the expected output.

I realize that the reproduction I've attached to this ticket 
(zombie-analysis.py plus the related files), though complete and 
self-contained, is a bit verbose. If it's not helpful enough I will see if I 
can boil it down further.

[~petertoth] - I suggest you take another look at the output with cross joins 
enabled and compare it to what (I think) is the correct expected output. If I'm 
understanding things correctly, there are two issues: 1) the bad error when 
cross join is not enabled (there should be no error), and 2) the incorrect 
results when cross join _is_ enabled (the results I just uploaded).

Your PR doesn't appear to investigate or address the incorrect results issue, 
so I'm not sure if it would fix that too of if I am just mistaken about there 
being a second issue.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Attachment: expected-output.txt

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Description: 
I have two DataFrames, A and B. From B, I have derived two additional 
DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
confusing error:
{code:java}
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
{code}
Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark 
appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of 
bug here. The "join condition is missing" error is confusing and doesn't make 
sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without 
the implicit cross join enabled.

I realize the join I've written is not "correct" in the sense that it should be 
left outer join instead of an inner join (since some of the aggregates are not 
available for all states), but that doesn't explain Spark's behavior.

  was:
I have two DataFrames, A and B. From B, I have derived two additional 
DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
confusing error:
{code:java}
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
{code}
Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark 
appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of 
bug here. The "join condition is missing" error is confusing and doesn't make 
sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without 
the implicit cross join enabled.

I realize the join I've written is not correct in the sense that it should be 
left outer join instead of an inner join (since some of the aggregates are not 
available for all states), but that doesn't explain Spark's behavior.


> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632252#comment-16632252
 ] 

Nicholas Chammas commented on SPARK-25150:
--

The attachments on this ticket contain a complete reproduction. The comment 
towards the beginning of zombie-analysis.py points to the config that, when 
enabled, appears to yield incorrect results. (Without the config enabled we get 
a confusing/incorrect error, which is a second issue.)

The results with and without the config enabled are also attached here. I will 
add another attachment showing the expected results.

I believe some folks over on the linked PR provided a simpler reproduction of 
part of this issue, but I haven't taken a close look at it to see if it 
captures the same two issues (incorrect results + confusing/incorrect error).

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1663#comment-1663
 ] 

Nicholas Chammas commented on SPARK-25150:
--

[~cloud_fan] / [~srowen] - Would you consider this issue (particularly the one 
expressed when spark.sql.crossJoin.enabled is set to true) to be a correctness 
bug? I think it is, but I'd like a committer to confirm and add the appropriate 
label if necessary.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-21 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623788#comment-16623788
 ] 

Nicholas Chammas commented on SPARK-25150:
--

Given that Spark appears to provide incorrect results when 
spark.sql.crossJoin.enabled is set to true, shall we mark this as a correctness 
issue?

[~petertoth] / [~EeveeB] - Would you agree with that characterization?

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584239#comment-16584239
 ] 

Nicholas Chammas edited comment on SPARK-25150 at 8/17/18 6:15 PM:
---

I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there. I will be away for some time and thought it 
best to report this now in case someone can pick it up and investigate further 
until I get back.

cc [~marmbrus].


was (Author: nchammas):
I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there.

cc [~marmbrus].

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584239#comment-16584239
 ] 

Nicholas Chammas commented on SPARK-25150:
--

I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there.

cc [~marmbrus].

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Attachment: zombie-analysis.py
states.csv
persons.csv
output-without-implicit-cross-join.txt
output-with-implicit-cross-join.txt

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-25150:


 Summary: Joining DataFrames derived from the same source yields 
confusing/incorrect results
 Key: SPARK-25150
 URL: https://issues.apache.org/jira/browse/SPARK-25150
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Nicholas Chammas
 Attachments: output-with-implicit-cross-join.txt, 
output-without-implicit-cross-join.txt, persons.csv, states.csv, 
zombie-analysis.py

I have two DataFrames, A and B. From B, I have derived two additional 
DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
confusing error:
{code:java}
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
{code}
Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark 
appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of 
bug here. The "join condition is missing" error is confusing and doesn't make 
sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without 
the implicit cross join enabled.

I realize the join I've written is not correct in the sense that it should be 
left outer join instead of an inner join (since some of the aggregates are not 
available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468045#comment-16468045
 ] 

Nicholas Chammas edited comment on SPARK-23945 at 5/8/18 10:22 PM:
---

{quote}So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa
{quote}
Since writing this, I realized that the DataFrame API is able to express `IN` 
and `NOT IN` via an inner join and left anti join respectively. And I suspect 
most other cases where I may have thought the DataFrame API is less powerful 
than SQL are incorrect. The various DataFrame join types basically cover a lot 
of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing 
users, in the particular example I provided above, to express their query as 
follows:
{code:java}
(table1
.join(
table2,
on='name',
how='left_anti',
)
){code}
This is equivalent to the SQL query I posted, and does not require that 
anything be collected locally, so it scales just as well.

[~hvanhovell] - Does this make sense?


was (Author: nchammas):
> So in the grand scheme of things I'd expect DataFrames to be able to do 
> everything that SQL can and vice versa

 

Since writing this, I realized that the DataFrame API is able to express `IN` 
and `NOT IN` via an inner join and left anti join respectively. And I suspect 
most other cases where I may have thought the DataFrame API is less powerful 
than SQL are incorrect. The various DataFrame join types basically cover a lot 
of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing 
users, in the particular example I provided above, to express their query as 
follows:
{code:java}
(table1
.join(
table2,
on='name',
how='left_anti',
)
){code}
This is equivalent to the SQL query I posted, and does not require that 
anything be collected locally, so it scales just as well.

[~hvanhovell] - Does this make sense?

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468045#comment-16468045
 ] 

Nicholas Chammas commented on SPARK-23945:
--

> So in the grand scheme of things I'd expect DataFrames to be able to do 
> everything that SQL can and vice versa

 

Since writing this, I realized that the DataFrame API is able to express `IN` 
and `NOT IN` via an inner join and left anti join respectively. And I suspect 
most other cases where I may have thought the DataFrame API is less powerful 
than SQL are incorrect. The various DataFrame join types basically cover a lot 
of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing 
users, in the particular example I provided above, to express their query as 
follows:
{code:java}
(table1
.join(
table2,
on='name',
how='left_anti',
)
){code}
This is equivalent to the SQL query I posted, and does not require that 
anything be collected locally, so it scales just as well.

[~hvanhovell] - Does this make sense?

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316
 ] 

Nicholas Chammas edited comment on SPARK-23945 at 5/8/18 10:13 PM:
---

I always looked at DataFrames and SQL as two different interfaces to the same 
underlying logical model, so I just assumed that the vision was for them to be 
equally powerful. Is that not the case?

So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa, but for the narrow purposes of this 
ticket I'm just interested in {{IN and NOT IN.}}


was (Author: nchammas):
I always looked at DataFrames and SQL as two different interfaces to the same 
underlying logical model, so I just assumed that the vision was for them to be 
equally powerful. Is that not the case?

So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa, but for the narrow purposes of this 
ticket I'm just interested in {{IN }}and {{NOT IN.}}

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316
 ] 

Nicholas Chammas commented on SPARK-23945:
--

I always looked at DataFrames and SQL as two different interfaces to the same 
underlying logical model, so I just assumed that the vision was for them to be 
equally powerful. Is that not the case?

So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa, but for the narrow purposes of this 
ticket I'm just interested in {{IN }}and {{NOT IN.}}

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-23945:
-
Description: 
In SQL you can filter rows based on the result of a subquery:
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 

  was:
In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 


> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-23945:


 Summary: Column.isin() should accept a single-column DataFrame as 
input
 Key: SPARK-23945
 URL: https://issues.apache.org/jira/browse/SPARK-23945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Nicholas Chammas


In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-26 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414190#comment-16414190
 ] 

Nicholas Chammas commented on SPARK-22513:
--

Thanks for the breakdown. This will be handy for reference. So I guess at the 
summary level Sean was correct. :D

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412423#comment-16412423
 ] 

Nicholas Chammas edited comment on SPARK-23716 at 3/24/18 5:13 AM:
---

For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies [my need|https://github.com/nchammas/flintrock/issues/238], 
which is focused on syncing Spark releases from the Apache archive to an S3 
bucket.

Closing this as "Won't Fix".


was (Author: nchammas):
For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies my need. I am syncing Spark releases from the Apache 
distribution archive to a personal S3 bucket and need a way to verify the 
integrity of the files.

> Change SHA512 style in release artifacts to play nicely with shasum utility
> ---
>
> Key: SPARK-23716
> URL: https://issues.apache.org/jira/browse/SPARK-23716
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As [discussed 
> here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-23716.
--
Resolution: Won't Fix

For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies my need. I am syncing Spark releases from the Apache 
distribution archive to a personal S3 bucket and need a way to verify the 
integrity of the files.

> Change SHA512 style in release artifacts to play nicely with shasum utility
> ---
>
> Key: SPARK-23716
> URL: https://issues.apache.org/jira/browse/SPARK-23716
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As [discussed 
> here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412218#comment-16412218
 ] 

Nicholas Chammas commented on SPARK-22513:
--

Fair enough.

Just as an alternate confirmation, [~ste...@apache.org] can you comment on 
whether there might be any issues running Spark built against Hadoop 2.7 with, 
say, HDFS 2.8?

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405615#comment-16405615
 ] 

Nicholas Chammas commented on SPARK-23534:
--

I don't know what it takes to add a Hadoop 3.0 build profile to Spark, and I 
also don't know anyone who can help with this.

However, I don't see the urgency in getting this done. Who, really, is already 
running Hadoop 3.0 in production, or even getting close to that? And what 
vendors are already shipping Hadoop 3.0? Cloudera still ships 2.6 and EMR is on 
2.7.

Hadoop releases have historically been very disruptive, with 
backwards-incompatible changes even in minor releases. If that's still the 
case, I doubt many people will be running Hadoop 3.0 anytime soon. But I 
confess my stance here is based on my experience a couple of years ago and may 
no longer be relevant.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405563#comment-16405563
 ] 

Nicholas Chammas commented on SPARK-23534:
--

I believe this ticket is a duplicate of SPARK-23151, though given the activity 
here perhaps we should close that one in favor of this ticket here.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405561#comment-16405561
 ] 

Nicholas Chammas commented on SPARK-22513:
--

[~srowen] - Just curious: How do you know that Spark built against Hadoop 2.7 
should work with Hadoop 2.8? Is there a reference for that somewhere?

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-16 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-23716:


 Summary: Change SHA512 style in release artifacts to play nicely 
with shasum utility
 Key: SPARK-23716
 URL: https://issues.apache.org/jira/browse/SPARK-23716
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.3.0
Reporter: Nicholas Chammas


As [discussed 
here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-03-07 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389723#comment-16389723
 ] 

Nicholas Chammas commented on SPARK-18492:
--

[~imranshaik] - This is an open source project. You cannot demand that anyone 
"solve this asap". People are contributing their free time or time donated by 
the companies that employ them.

If you want someone to fix this for you "asap", perhaps you should look at paid 
support from Databricks, Cloudera, Hortonworks, Amazon, or some other big 
company that works with Spark.

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>Priority: Major
> Attachments: Screenshot from 2018-03-02 12-57-51.png
>
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ 

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383843#comment-16383843
 ] 

Nicholas Chammas commented on SPARK-18492:
--

Are you seeing the same on Spark 2.3.0? Apparently, a bunch of problems related 
to the 64KB limit were resolved. They may not have an impact on this error with 
GeneratedIterator, but it's good to check just in case.

From [http://spark.apache.org/releases/spark-release-2-3-0.html:]

*Performance and stability*
 * [SPARK-22510][SPARK-22692][SPARK-21871] Further stabilize the codegen 
framework to avoid hitting the {{64KB}} JVM bytecode limit on the Java method 
and Java compiler constant pool limit

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>Priority: Major
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> 

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217115#comment-16217115
 ] 

Nicholas Chammas commented on SPARK-13587:
--

To follow-up on my [earlier 
comment|https://issues.apache.org/jira/browse/SPARK-13587?focusedCommentId=15740419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15740419],
 I created a [completely self-contained sample 
repo|https://github.com/massmutual/sample-pyspark-application] demonstrating a 
technique for bundling PySpark app dependencies in an isolated way. It's the 
technique that Ben, I, and several others discussed here in this JIRA issue.

https://github.com/massmutual/sample-pyspark-application

The approach has advantages (like letting you ship a completely isolated Python 
environment, so you don't even need Python installed on the workers) and 
disadvantages (requires YARN; increases job startup time). Hope some of you 
find the sample repo useful until Spark adds more "first-class" support for 
Python dependency isolation.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-09-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168038#comment-16168038
 ] 

Nicholas Chammas commented on SPARK-17025:
--

I take that back. I won't be able to test this for the time being. If someone 
else wants to test this out and needs pointers, I'd be happy to help.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-08-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128271#comment-16128271
 ] 

Nicholas Chammas commented on SPARK-17025:
--

I'm still interested in this but I won't be able to test it until mid-next 
month, unfortunately. I've set myself a reminder to revisit this.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21712) Clarify PySpark Column.substr() type checking error message

2017-08-11 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-21712:


 Summary: Clarify PySpark Column.substr() type checking error 
message
 Key: SPARK-21712
 URL: https://issues.apache.org/jira/browse/SPARK-21712
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Nicholas Chammas
Priority: Trivial


https://github.com/apache/spark/blob/f0169a1c6a1ac06045d57f8aaa2c841bb39e23ac/python/pyspark/sql/column.py#L408-L409

"Can not mix the type" is really unclear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21110) Structs should be usable in inequality filters

2017-06-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059536#comment-16059536
 ] 

Nicholas Chammas commented on SPARK-21110:
--

cc [~marmbrus] - Assuming this is a valid feature request, maybe it belongs as 
part of a larger umbrella task.

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21110) Structs should be usable in inequality filters

2017-06-15 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-21110:
-
Summary: Structs should be usable in inequality filters  (was: Structs 
should be orderable)

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21110) Structs should be orderable

2017-06-15 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-21110:


 Summary: Structs should be orderable
 Key: SPARK-21110
 URL: https://issues.apache.org/jira/browse/SPARK-21110
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Nicholas Chammas
Priority: Minor


It seems like a missing feature that you can't compare structs in a filter on a 
DataFrame.

Here's a simple demonstration of a) where this would be useful and b) how it's 
different from simply comparing each of the components of the structs.

{code}
import pyspark
from pyspark.sql.functions import col, struct, concat

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
[
('Boston', 'Bob'),
('Boston', 'Nick'),
('San Francisco', 'Bob'),
('San Francisco', 'Nick'),
],
['city', 'person']
)
pairs = (
df.select(
struct('city', 'person').alias('p1')
)
.crossJoin(
df.select(
struct('city', 'person').alias('p2')
)
)
)

print("Everything")
pairs.show()

print("Comparing parts separately (doesn't give me what I want)")
(pairs
.where(col('p1.city') < col('p2.city'))
.where(col('p1.person') < col('p2.person'))
.show())

print("Comparing parts together with concat (gives me what I want but is 
hacky)")
(pairs
.where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
.show())

print("Comparing parts together with struct (my desired solution but currently 
yields an error)")
(pairs
.where(col('p1') < col('p2'))
.show())
{code}

The last query yields the following error in Spark 2.1.1:

{code}
org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or 
int or bigint or float or double or decimal or timestamp or date or string or 
binary) type, not struct;;
'Filter (p1#5 < p2#8)
+- Join Cross
   :- Project [named_struct(city, city#0, person, person#1) AS p1#5]
   :  +- LogicalRDD [city#0, person#1]
   +- Project [named_struct(city, city#0, person, person#1) AS p2#8]
  +- LogicalRDD [city#0, person#1]
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2017-06-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035062#comment-16035062
 ] 

Nicholas Chammas commented on SPARK-12661:
--

I think we are good to resolve this provided that we've stopped testing with 
Python 2.6. Any cleanup of 2.6-specific workarounds (tracked in SPARK-20149) 
can be done separately IMO.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9862) Join: Handling data skew

2017-05-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020030#comment-16020030
 ] 

Nicholas Chammas commented on SPARK-9862:
-

Is this issue meant to be a SQL-equivalent of SPARK-4644 (as suggested by 
[~rxin] in the comments)?

> Join: Handling data skew
> 
>
> Key: SPARK-9862
> URL: https://issues.apache.org/jira/browse/SPARK-9862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
> Attachments: Handling skew data in join.pdf
>
>
> For a two way shuffle join, if one or multiple groups are skewed in one table 
> (say left table) but having a relative small number of rows in another table 
> (say right table), we can use broadcast join for these skewed groups and use 
> shuffle join for other groups.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19553) Add GroupedData.countApprox()

2017-03-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870780#comment-15870780
 ] 

Nicholas Chammas edited comment on SPARK-19553 at 3/14/17 2:38 PM:
---

The utility of 1) would be being able to count items instead of distinct items, 
unless I misunderstood what you're saying. I would imagine that just counting 
items (as opposed to distinct items) would be cheaper, in addition to being 
semantically different.

-I'll open a PR for 3), unless someone else wants to step in and do that.-


was (Author: nchammas):
The utility of 1) would be being able to count items instead of distinct items, 
unless I misunderstood what you're saying. I would imagine that just counting 
items (as opposed to distinct items) would be cheaper, in addition to being 
semantically different.

I'll open a PR for 3), unless someone else wants to step in and do that.

> Add GroupedData.countApprox()
> -
>
> Key: SPARK-19553
> URL: https://issues.apache.org/jira/browse/SPARK-19553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We already have a 
> [{{pyspark.sql.functions.approx_count_distinct()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approx_count_distinct]
>  that can be applied to grouped data, but it seems odd that you can't just 
> get regular approximate count for grouped data.
> I imagine the API would mirror that for 
> [{{RDD.countApprox()}}|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countApprox],
>  but I'm not sure:
> {code}
> (df
> .groupBy('col1')
> .countApprox(timeout=300, confidence=0.95)
> .show())
> {code}
> Or, if we want to mirror the {{approx_count_distinct()}} function, we can do 
> that too. I'd want to understand why that function doesn't take a timeout or 
> confidence parameter, though. Also, what does {{rsd}} mean? It's not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893703#comment-15893703
 ] 

Nicholas Chammas commented on SPARK-15474:
--

cc [~owen.omalley]

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890930#comment-15890930
 ] 

Nicholas Chammas commented on SPARK-19578:
--

Makes sense to me. I suppose the Apache Arrow integration work that is 
currently ongoing (SPARK-13534) will not help RDD.count() since that will only 
benefit DataFrames. (Granted, in this specific example you can always read the 
file using spark.read.csv() or spark.read.text() which will avoid this problem.)

So it sounds like the "poor PySpark performance" part of this issue is 
"Won't/Can't fix" at this time. The incorrect UI input-size metrics sounds like 
a separate issue that should be split out. 

> Poor pyspark performance + incorrect UI input-size metrics
> --
>
> Key: SPARK-19578
> URL: https://issues.apache.org/jira/browse/SPARK-19578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 1.6.1, 1.6.2, 2.0.1
> Environment: Spark 1.6.2 Hortonworks
> Spark 2.0.1 MapR
> Spark 1.6.1 MapR
>Reporter: Artur Sukhenko
> Attachments: pyspark_incorrect_inputsize.png, reproduce_log, 
> spark_shell_correct_inputsize.png
>
>
> Simple job in pyspark takes 14 minutes to complete.
> The text file used to reproduce contains multiple millions lines of one word 
> "yes"
>  (it might be the cause of poor performance)
> {code}
> var a = sc.textFile("/tmp/yes.txt")
> a.count()
> {code}
> Same code took 33 sec in spark-shell
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> [spark@c6401 ~]$ yes > /tmp/yes.txt
> [spark@c6401 ~]$ ll /tmp/
> -rw-r--r--  1 spark hadoop  516079616 Feb 13 11:10 yes.txt
> [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> [spark@c6401 ~]$ pyspark
> {code}
> Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
> 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2
> {code}
> >>> a = sc.textFile("/tmp/yes.txt")
> {code}
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 341.1 KB, free 341.1 KB)
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 28.3 KB, free 369.4 KB)
> 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43389 (size: 28.3 KB, free: 517.4 MB)
> 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {code}
> >>> a.count()
> {code}
> 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1
> 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1
> 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 
> output partitions
> 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
> :1)
> 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] 
> at count at :1), which has no missing parents
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 5.7 KB, free 375.1 KB)
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
> in memory (estimated size 3.5 KB, free 378.6 KB)
> 17/02/13 11:13:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on localhost:43389 (size: 3.5 KB, free: 517.4 MB)
> 17/02/13 11:13:03 INFO SparkContext: Created broadcast 1 from broadcast at 
> DAGScheduler.scala:1008
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting 4 missing tasks from 
> ResultStage 0 (PythonRDD[2] at count at :1)
> 17/02/13 11:13:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
> 17/02/13 11:13:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, partition 0,ANY, 2149 bytes)
> 17/02/13 11:13:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 17/02/13 11:13:03 INFO HadoopRDD: Input split: 
> hdfs://c6401.ambari.apache.org:8020/tmp/yes.txt:0+134217728
> 17/02/13 11:13:03 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 17/02/13 11:13:03 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> 17/02/13 11:13:03 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> 17/02/13 11:16:37 INFO PythonRunner: Times: total = 213335, boot = 317, init 
> = 445, finish = 212573
> 17/02/13 11:16:37 INFO Executor: Finished task 0.0 

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890639#comment-15890639
 ] 

Nicholas Chammas commented on SPARK-15474:
--

There is a related discussion on ORC-152 which suggests that this is an issue 
with Spark's DataFrame writer for ORC. If there is evidence that this is not 
the case, it would be good to post it directly on ORC-152 so we can get input 
from people on that project.

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >