[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-26 Thread Ohad Raviv (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523344#comment-16523344
 ] 

Ohad Raviv commented on SPARK-24528:


Hi,

well it took me some time to get to it, but here are my design conclusions:
 # currently all the file scans are done with FileScanRDD. in its current 
implementation it gets a list of files in each partition and iterates the one 
after the other.
 # that means we probably need another FileScanRDD that can "open" all the 
files and iterate them in a merge sort manner (like maintaing a heap to know 
what's the next file to iterate from).
 # the FileScanRDD is created in FileSourceScanExec.createBucketedReadRDD if 
the data is bucketed.
 # FileSourceScanExec is created in FileSourceStrategy.
 # that means we could understand if the data read output is required to be 
sorted in FileSourceStrategy and percolate this knowledge to the creation of 
the new FileScan(Sorted?)RDD.
 # thing to note here is to enable this sorted reading only if it's required 
otherwise it will cause performance issue.

please tell me WDYT.

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24650) GroupingSet

2018-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24650:
-
Priority: Major  (was: Blocker)

> GroupingSet
> ---
>
> Key: SPARK-24650
> URL: https://issues.apache.org/jira/browse/SPARK-24650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: CDH 5.X, Spark 2.3
>Reporter: Mihir Sahu
>Priority: Major
>  Labels: Grouping, Sets
>
> If a grouping set is used in spark sql, then the plan does not perform 
> optimally.
> If input to a grouping set is X rows and the grouping sets has y group, then 
> the number of rows that are processed is currently x*y rows.
> Example : Let a Dataframe have  col1, col2, col3 and col4 columns and number 
> of row be rowNo.
> and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2
> Number of row processed in such case is 3*(rowNos * size of each row).
> However is this the optimal way of processing data.
> If the groups of y are derivable for each other, can we reduce the amount of 
> volume processed by removing columns as we progress to the lower dimension of 
> processing.
> Currently while doing processing percentile, a lot of data seems to be 
> processed causing performance issue.
> Need to look if this can be optimised



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24650) GroupingSet

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523351#comment-16523351
 ] 

Hyukjin Kwon commented on SPARK-24650:
--

Please avoid to set a blocker which is usually reserved for a committer.

> GroupingSet
> ---
>
> Key: SPARK-24650
> URL: https://issues.apache.org/jira/browse/SPARK-24650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: CDH 5.X, Spark 2.3
>Reporter: Mihir Sahu
>Priority: Major
>  Labels: Grouping, Sets
>
> If a grouping set is used in spark sql, then the plan does not perform 
> optimally.
> If input to a grouping set is X rows and the grouping sets has y group, then 
> the number of rows that are processed is currently x*y rows.
> Example : Let a Dataframe have  col1, col2, col3 and col4 columns and number 
> of row be rowNo.
> and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2
> Number of row processed in such case is 3*(rowNos * size of each row).
> However is this the optimal way of processing data.
> If the groups of y are derivable for each other, can we reduce the amount of 
> volume processed by removing columns as we progress to the lower dimension of 
> processing.
> Currently while doing processing percentile, a lot of data seems to be 
> processed causing performance issue.
> Need to look if this can be optimised



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24651) Add ability to write null values while writing JSON

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523383#comment-16523383
 ] 

Hyukjin Kwon commented on SPARK-24651:
--

I think it's basically a duplicate of SPARK-23773.

> Add ability to write null values while writing JSON
> ---
>
> Key: SPARK-24651
> URL: https://issues.apache.org/jira/browse/SPARK-24651
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Matthew Liem
>Priority: Minor
>
> Hello,
>  Spark is configured to ignore the null values when writing JSON based off of 
> JacksonMessageWriter.scala during serialization: 
> |mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)|
> In some scenarios it is useful to maintain these fields..looking to see if 
> this functionality can be added or configurable to set e.g. use 
> Include.ALWAYS or other properties depending on requirement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24649) SparkUDF.unapply is not backwards compatable

2018-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24649.
--
Resolution: Invalid

catalysis is considered as an internal API, and subject to change between minor 
releases.

> SparkUDF.unapply is not backwards compatable
> 
>
> Key: SPARK-24649
> URL: https://issues.apache.org/jira/browse/SPARK-24649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Simeon H.K. Fitch
>Priority: Minor
>
> The shape of the `ScalaUDF` case class changed in 2.3.0.  A secondary 
> constructor that's backwards compatible with 2.1.x and 2.2.x was provided, 
> but a corresponding `unapply` method wasn't included. Therefore code such as 
> the following that worked in 2.1.x and 2.2.x no longer compiles:
> {code:java}
> val ScalaUDF(function, dataType, children, inputTypes, udfName) = myUDF
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523399#comment-16523399
 ] 

Hyukjin Kwon commented on SPARK-24644:
--

Can you clarify the environment, in particular, PyArrow and Pandas versions?

> Pyarrow exception while running pandas_udf on pyspark 2.3.1
> ---
>
> Key: SPARK-24644
> URL: https://issues.apache.org/jira/browse/SPARK-24644
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1
> Environment: os: centos
> pyspark 2.3.1
> spark 2.3.1
> pyarrow >= 0.8.0
>Reporter: Hichame El Khalfi
>Priority: Major
>
> Hello,
> When I try to run a `pandas_udf` on my spark dataframe, I get this error
>  
> {code:java}
>   File 
> "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py",
>  lin
> e 280, in load_stream
> pdf = batch.to_pandas()
>   File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226)
> return Table.from_batches([self]).to_pandas(nthreads=nthreads)
>   File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331)
> mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 528, in table_to_blockmanager
> blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 622, in _table_to_blocks
> return [_reconstruct_block(item) for item in result]
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 446, in _reconstruct_block
> block = _int.make_block(block_arr, placement=placement)
> TypeError: make_block() takes at least 3 arguments (2 given)
> {code}
>  
>  More than happy to provide any additional information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24643) from_json should accept an aggregate function as schema

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523403#comment-16523403
 ] 

Hyukjin Kwon commented on SPARK-24643:
--

SPARK-24642 is not added yet though ... 

> from_json should accept an aggregate function as schema
> ---
>
> Key: SPARK-24643
> URL: https://issues.apache.org/jira/browse/SPARK-24643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, the *from_json()* function accepts only string literals as schema:
>  - Checking of schema argument inside of JsonToStructs: 
> [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L530]
>  - Accepting only string literal: 
> [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L749-L752]
> JsonToStructs should be modified to accept results of aggregate functions 
> like *infer_schema* (see SPARK-24642). It should be possible to write SQL 
> like:
> {code:sql}
> select from_json(json_col, infer_schema(json_col)) from json_table
> {code}
> Here is a test case with existing aggregate function - *first()*:
> {code:sql}
> create temporary view schemas(schema) as select * from values
>   ('struct'),
>   ('map');
> select from_json('{"a":1}', first(schema)) from schemas;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24647) Sink Should Return OffsetSeqs For ProgressReporting

2018-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24647:
-
Fix Version/s: (was: 2.4.0)

> Sink Should Return OffsetSeqs For ProgressReporting
> ---
>
> Key: SPARK-24647
> URL: https://issues.apache.org/jira/browse/SPARK-24647
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Priority: Major
>
> To be able to track data lineage for Structured Streaming (I intend to 
> implement this to Open Source Project Spline), the monitoring needs to be 
> able to not only to track where the data was read from but also where results 
> were written to. This could be to my knowledge best implemented using 
> monitoring {{StreamingQueryProgress}}. However currently batch data offsets 
> are not available on {{Sink}} interface. Implementing as proposed would also 
> bring symmetry to {{StreamingQueryProgress}} fields sources and sink.
>  
> *Similar Proposals*
> Made in following jiras. These would not be sufficient for lineage tracking.
>  * https://issues.apache.org/jira/browse/SPARK-18258
>  * https://issues.apache.org/jira/browse/SPARK-21313
>  
> *Current State*
>  * Method {{Sink#addBatch}} returns {{Unit}}.
>  * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
> {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
>   }
> {code}
>  
>  
> *Proposed State*
>  * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying 
> offsets of the written batch, e.g. Kafka does it by returning 
> {{RecordMetadata}} object from {{send}} method.
>  * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
> as {{sourceProgress}}.
>  
>  
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
>    "startOffset" : null,
>     "endOffset" { "sinkTopic": { "0": 333 }}
>   }
> {code}
>  
> *Implementation*
> * PR submitters: Likely will be me and [~wajda] as soon as the discussion 
> ends positively. 
>  * {{Sinks}}: Modify all sinks to conform a new interface or return dummy 
> values.
>  * {{ProgressReporter}}: Merge offsets from different batches properly, 
> similarly to how it is done for sources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24647) Sink Should Return OffsetSeqs For ProgressReporting

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523405#comment-16523405
 ] 

Hyukjin Kwon commented on SPARK-24647:
--

(please avoid to set a fix version which is usually set when it's actually 
fixed)

> Sink Should Return OffsetSeqs For ProgressReporting
> ---
>
> Key: SPARK-24647
> URL: https://issues.apache.org/jira/browse/SPARK-24647
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Priority: Major
>
> To be able to track data lineage for Structured Streaming (I intend to 
> implement this to Open Source Project Spline), the monitoring needs to be 
> able to not only to track where the data was read from but also where results 
> were written to. This could be to my knowledge best implemented using 
> monitoring {{StreamingQueryProgress}}. However currently batch data offsets 
> are not available on {{Sink}} interface. Implementing as proposed would also 
> bring symmetry to {{StreamingQueryProgress}} fields sources and sink.
>  
> *Similar Proposals*
> Made in following jiras. These would not be sufficient for lineage tracking.
>  * https://issues.apache.org/jira/browse/SPARK-18258
>  * https://issues.apache.org/jira/browse/SPARK-21313
>  
> *Current State*
>  * Method {{Sink#addBatch}} returns {{Unit}}.
>  * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
> {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
>   }
> {code}
>  
>  
> *Proposed State*
>  * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying 
> offsets of the written batch, e.g. Kafka does it by returning 
> {{RecordMetadata}} object from {{send}} method.
>  * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
> as {{sourceProgress}}.
>  
>  
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
>    "startOffset" : null,
>     "endOffset" { "sinkTopic": { "0": 333 }}
>   }
> {code}
>  
> *Implementation*
> * PR submitters: Likely will be me and [~wajda] as soon as the discussion 
> ends positively. 
>  * {{Sinks}}: Modify all sinks to conform a new interface or return dummy 
> values.
>  * {{ProgressReporter}}: Merge offsets from different batches properly, 
> similarly to how it is done for sources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523417#comment-16523417
 ] 

Hyukjin Kwon commented on SPARK-24570:
--

So you are saying

{code}
== SQL ==
SHOW TABLE EXTENDED FROM sit1_pb LIKE `*`
--^^^
{code}

doesn't work in Spark?

> SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel 
> SQL, DBVisualizer.etc)
> ---
>
> Key: SPARK-24570
> URL: https://issues.apache.org/jira/browse/SPARK-24570
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: t oo
>Priority: Major
> Attachments: connect-to-sql-db-ssms-locate-table.png
>
>
> An end-user SQL client tool (ie in the screenshot) can list tables from 
> hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with 
> SparkSQL it does not display any tables. This would be very convenient for 
> users.
> This is the exception in the client tool (Aqua Data Studio):
> {code:java}
> Title: An Error Occurred
> Summary: Unable to Enumerate Result
>  Start Message 
> 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '`*`' expecting STRING(line 1, pos 38)
> == SQL ==
> SHOW TABLE EXTENDED FROM sit1_pb LIKE `*`
> --^^^
>  End Message 
> 
>  Start Stack Trace 
> 
> java.sql.SQLException: org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '`*`' expecting STRING(line 1, pos 38)
> == SQL ==
> SHOW TABLE EXTENDED FROM sit1_pb LIKE `*`
> --^^^
>   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
>   at com.aquafold.aquacore.open.rdbms.drivers.hive.Qꐨꈬꈦꁐ.execute(Unknown 
> Source)
>   at \\.\\.\\हिñçêČάй語简�?한\\.gᚵ᠃᠍ꃰint.execute(Unknown Source)
>   at com.common.ui.tree.hꐊᠱꇗꇐ9int.yW(Unknown Source)
>   at com.common.ui.tree.hꐊᠱꇗꇐ9int$1.process(Unknown Source)
>   at com.common.ui.util.BackgroundThread.run(Unknown Source)
>  End Stack Trace 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523431#comment-16523431
 ] 

Hyukjin Kwon commented on SPARK-24530:
--

macOS, Python 2.7.14, Sphinx 1.4.1 shows:

{code}
class pyspark.ml.classification.LogisticRegression(*args, **kwargs)[source]
Logistic regression. This class supports multinomial logistic (softmax) and 
binomial logistic regression.
{code}


> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523433#comment-16523433
 ] 

Hyukjin Kwon commented on SPARK-24530:
--

I have another computer: macOS, Python 2.7.14, Sphinx 1.7.2 shows:

{code}
class pyspark.ml.classification.LogisticRegression(*args, **kwargs)[source]
Logistic regression. This class supports multinomial logistic (softmax) and 
binomial logistic regression.
{code}

I think we need [~dongjoon]'s input.

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523443#comment-16523443
 ] 

Hyukjin Kwon commented on SPARK-24458:
--

I usually just checkout on the tag, for example, {{git checkout v2.3.1}}, and 
build it .. but I usually just download minor versions and check it frankly 
because usually maintenance release doesn't have a big change.

> Invalid PythonUDF check_1(), requires attributes from more than one child
> -
>
> Key: SPARK-24458
> URL: https://issues.apache.org/jira/browse/SPARK-24458
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0 (local mode)
> Mac OSX
>Reporter: Abdeali Kothari
>Priority: Major
>
> I was trying out a very large query execution plan I have and I got the error:
>  
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o359.simpleString.
> : java.lang.RuntimeException: Invalid PythonUDF check_1(), requires 
> attributes from more than one child.
>  at scala.sys.package$.error(package.scala:27)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181)
>  at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
>  at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMeth

[jira] [Assigned] (SPARK-22425) add output files information to EventLogger

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22425:


Assignee: Apache Spark

> add output files information to EventLogger
> ---
>
> Key: SPARK-22425
> URL: https://issues.apache.org/jira/browse/SPARK-22425
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Long Tian
>Assignee: Apache Spark
>Priority: Major
>  Labels: patch
>
> We can get all the input files from *EventLogger* when 
> *spark.eventLog.enabled* is *true*. But there's no output files information. 
> Is it possible to add some output files information to *EventLogger*? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22425) add output files information to EventLogger

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523459#comment-16523459
 ] 

Apache Spark commented on SPARK-22425:
--

User 'voidfunction' has created a pull request for this issue:
https://github.com/apache/spark/pull/21642

> add output files information to EventLogger
> ---
>
> Key: SPARK-22425
> URL: https://issues.apache.org/jira/browse/SPARK-22425
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Long Tian
>Priority: Major
>  Labels: patch
>
> We can get all the input files from *EventLogger* when 
> *spark.eventLog.enabled* is *true*. But there's no output files information. 
> Is it possible to add some output files information to *EventLogger*? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22425) add output files information to EventLogger

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22425:


Assignee: (was: Apache Spark)

> add output files information to EventLogger
> ---
>
> Key: SPARK-22425
> URL: https://issues.apache.org/jira/browse/SPARK-22425
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Long Tian
>Priority: Major
>  Labels: patch
>
> We can get all the input files from *EventLogger* when 
> *spark.eventLog.enabled* is *true*. But there's no output files information. 
> Is it possible to add some output files information to *EventLogger*? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child

2018-06-26 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523475#comment-16523475
 ] 

Ruben Berenguel commented on SPARK-24458:
-

Oh, big facepalm, thanks [~hyukjin.kwon]. My autocomplete script completes 
branches but not tags, so could not find 2.3.0 and I've been so long using it I 
expect it to always work. I need to fix that :D. I'll check reproducibility and 
try to find the change that fixed this for completeness

> Invalid PythonUDF check_1(), requires attributes from more than one child
> -
>
> Key: SPARK-24458
> URL: https://issues.apache.org/jira/browse/SPARK-24458
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0 (local mode)
> Mac OSX
>Reporter: Abdeali Kothari
>Priority: Major
>
> I was trying out a very large query execution plan I have and I got the error:
>  
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o359.simpleString.
> : java.lang.RuntimeException: Invalid PythonUDF check_1(), requires 
> attributes from more than one child.
>  at scala.sys.package$.error(package.scala:27)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181)
>  at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
>  at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187)
>  at sun.reflec

[jira] [Commented] (SPARK-24347) df.alias() in python API should not clear metadata by default

2018-06-26 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523476#comment-16523476
 ] 

Ruben Berenguel commented on SPARK-24347:
-

Pinging [~hyukjin.kwon], too :)

> df.alias() in python API should not clear metadata by default
> -
>
> Key: SPARK-24347
> URL: https://issues.apache.org/jira/browse/SPARK-24347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> currently when doing an alias on a column in pyspark I lose metadata:
> {code:java}
> print("just select = ", df.select(col("v")).schema.fields[0].metadata.keys())
> print("select alias= ", 
> df.select(col("v").alias("vv")).schema.fields[0].metadata.keys()){code}
> gives:
> {code:java}
> just select =  dict_keys(['ml_attr'])
> select alias=  dict_keys([]){code}
> After looking at alias() documentation I see that metadata is an optional 
> param. But it should not clear the metadata when it is not set. A default 
> solution should be to keep it as-is.
> Otherwise - it generates problems in a later part of the processing pipeline 
> when someone is depending on the metadata.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18649) sc.textFile(my_file).collect() raises socket.timeout on large files

2018-06-26 Thread Andrei Gorlanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523479#comment-16523479
 ] 

Andrei Gorlanov commented on SPARK-18649:
-

Hello, I am going to take care of it.

> sc.textFile(my_file).collect() raises socket.timeout on large files
> ---
>
> Key: SPARK-18649
> URL: https://issues.apache.org/jira/browse/SPARK-18649
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: PySpark version 1.6.2
>Reporter: Erik Cederstrand
>Priority: Major
>
> I'm trying to load a file into the driver with this code:
> contents = sc.textFile('hdfs://path/to/big_file.csv').collect()
> Loading into the driver instead of creating a distributed RDD is intentional 
> in this case. The file is ca. 6GB, and I have adjusted driver memory 
> accordingly to fit the local data. After some time, my spark/submitted job 
> crashes with the stack trace below.
> I have traced this to pyspark/rdd.py where the _load_from_socket() method 
> creates a socket with a hard-coded timeout of 3 seconds (this code is also 
> present in HEAD although I'm on PySpark 1.6.2). Raising this hard-coded value 
> to e.g. 600 lets me read the entire file.
> Is there any reason that this value does not use e.g. the 
> 'spark.network.timeout' setting instead?
> Traceback (most recent call last):
>   File "my_textfile_test.py", line 119, in 
> contents = sc.textFile('hdfs://path/to/file.csv').collect()
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 772, in collect
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 142, in _load_from_socket
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 517, in load_stream
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 511, in loads
>   File "/usr/lib/python2.7/socket.py", line 380, in read
> data = self._sock.recv(left)
> socket.timeout: timed out
> 16/11/30 13:33:14 WARN Utils: Suppressing exception in finally: Broken pipe
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$2.apply$mcV$sp(PythonRDD.scala:650)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1248)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649)
>   Suppressed: java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at 
> java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at 
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>   ... 3 more
> 16/11/30 13:33:14 ERROR PythonRDD: Error while sending iterator
> java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.a

[jira] [Created] (SPARK-24659) GenericArrayData.equals should respect element type differences

2018-06-26 Thread Kris Mok (JIRA)
Kris Mok created SPARK-24659:


 Summary: GenericArrayData.equals should respect element type 
differences
 Key: SPARK-24659
 URL: https://issues.apache.org/jira/browse/SPARK-24659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0, 2.4.0
Reporter: Kris Mok


Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect 
element type differences, due to a caveat in Scala's {{==}} operator.

e.g. {{new GenericArrayData(Array[Int](123)).equals(new 
GenericArrayData(Array[Long](123L)))}} currently returns true. But that's 
against the semantics of Spark SQL's array type, where {{array}} and 
{{array}} are considered to be incompatible types and thus should never 
be equal.

This ticket proposes to fix the implementation of {{GenericArrayData.equals}} 
so that it's more aligned to Spark SQL's array type semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24659) GenericArrayData.equals should respect element type differences

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523578#comment-16523578
 ] 

Apache Spark commented on SPARK-24659:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/21643

> GenericArrayData.equals should respect element type differences
> ---
>
> Key: SPARK-24659
> URL: https://issues.apache.org/jira/browse/SPARK-24659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Kris Mok
>Priority: Major
>
> Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect 
> element type differences, due to a caveat in Scala's {{==}} operator.
> e.g. {{new GenericArrayData(Array[Int](123)).equals(new 
> GenericArrayData(Array[Long](123L)))}} currently returns true. But that's 
> against the semantics of Spark SQL's array type, where {{array}} and 
> {{array}} are considered to be incompatible types and thus should never 
> be equal.
> This ticket proposes to fix the implementation of {{GenericArrayData.equals}} 
> so that it's more aligned to Spark SQL's array type semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24659) GenericArrayData.equals should respect element type differences

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24659:


Assignee: (was: Apache Spark)

> GenericArrayData.equals should respect element type differences
> ---
>
> Key: SPARK-24659
> URL: https://issues.apache.org/jira/browse/SPARK-24659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Kris Mok
>Priority: Major
>
> Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect 
> element type differences, due to a caveat in Scala's {{==}} operator.
> e.g. {{new GenericArrayData(Array[Int](123)).equals(new 
> GenericArrayData(Array[Long](123L)))}} currently returns true. But that's 
> against the semantics of Spark SQL's array type, where {{array}} and 
> {{array}} are considered to be incompatible types and thus should never 
> be equal.
> This ticket proposes to fix the implementation of {{GenericArrayData.equals}} 
> so that it's more aligned to Spark SQL's array type semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24659) GenericArrayData.equals should respect element type differences

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24659:


Assignee: Apache Spark

> GenericArrayData.equals should respect element type differences
> ---
>
> Key: SPARK-24659
> URL: https://issues.apache.org/jira/browse/SPARK-24659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Major
>
> Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect 
> element type differences, due to a caveat in Scala's {{==}} operator.
> e.g. {{new GenericArrayData(Array[Int](123)).equals(new 
> GenericArrayData(Array[Long](123L)))}} currently returns true. But that's 
> against the semantics of Spark SQL's array type, where {{array}} and 
> {{array}} are considered to be incompatible types and thus should never 
> be equal.
> This ticket proposes to fix the implementation of {{GenericArrayData.equals}} 
> so that it's more aligned to Spark SQL's array type semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24660) SHS is not showing properly errors when downloading logs

2018-06-26 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-24660:
---

 Summary: SHS is not showing properly errors when downloading logs
 Key: SPARK-24660
 URL: https://issues.apache.org/jira/browse/SPARK-24660
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: Marco Gaido


The History Server is not showing properly errors which happen when trying to 
download logs. In particular, when downloading logs for which the user is not 
authorized, the user sees a File not found error, instead of the unauthorized 
response.

Similarly, trying to download logs from a non-existing application returns a 
server error, instead of a 404 message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24660) SHS is not showing properly errors when downloading logs

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24660:


Assignee: (was: Apache Spark)

> SHS is not showing properly errors when downloading logs
> 
>
> Key: SPARK-24660
> URL: https://issues.apache.org/jira/browse/SPARK-24660
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server is not showing properly errors which happen when trying to 
> download logs. In particular, when downloading logs for which the user is not 
> authorized, the user sees a File not found error, instead of the unauthorized 
> response.
> Similarly, trying to download logs from a non-existing application returns a 
> server error, instead of a 404 message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24660) SHS is not showing properly errors when downloading logs

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523788#comment-16523788
 ] 

Apache Spark commented on SPARK-24660:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21644

> SHS is not showing properly errors when downloading logs
> 
>
> Key: SPARK-24660
> URL: https://issues.apache.org/jira/browse/SPARK-24660
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server is not showing properly errors which happen when trying to 
> download logs. In particular, when downloading logs for which the user is not 
> authorized, the user sees a File not found error, instead of the unauthorized 
> response.
> Similarly, trying to download logs from a non-existing application returns a 
> server error, instead of a 404 message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24660) SHS is not showing properly errors when downloading logs

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24660:


Assignee: Apache Spark

> SHS is not showing properly errors when downloading logs
> 
>
> Key: SPARK-24660
> URL: https://issues.apache.org/jira/browse/SPARK-24660
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> The History Server is not showing properly errors which happen when trying to 
> download logs. In particular, when downloading logs for which the user is not 
> authorized, the user sees a File not found error, instead of the unauthorized 
> response.
> Similarly, trying to download logs from a non-existing application returns a 
> server error, instead of a 404 message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24661) Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeException

2018-06-26 Thread David Mavashev (JIRA)
David Mavashev created SPARK-24661:
--

 Summary: Window API - using multiple fields for partitioning with 
WindowSpec API and dataset that is cached causes 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException
 Key: SPARK-24661
 URL: https://issues.apache.org/jira/browse/SPARK-24661
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Java API, PySpark
Affects Versions: 2.3.0
Reporter: David Mavashev


Steps to reproduce:

Creating a data set:

 
{code:java}
List simpleWindowColumns = new ArrayList();
simpleWindowColumns.add("column1");
simpleWindowColumns.add("column2");

Map expressionsWithAliasesEntrySet = new HashMap);

expressionsWithAliasesEntrySet.put("count(id)", "count_column");

DataFrameReader reader = sparkSession.read().format("csv");
Dataset sparkDataSet = reader.option("header", 
"true").load("/path/to/data/data.csv");

//Invoking cached:
sparkDataSet = sparkDataSet.cached()

//Creating window spec with 2 columns:
WindowSpec window = 
Window.partitionBy(JavaConverters.asScalaIteratorConverter(simpleWindowColumns.stream().map(item->sparkDataSet.col(item)).iterator()).asScala().toSeq());

sparkDataSet = 
sparkDataSet.withColumns(JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->item.getKey()).collect(Collectors.toList()).iterator()).asScala().toSeq(),

  
JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->new
 
Column(item.getValue()).over(finalWindow)).collect(Collectors.toList()).iterator()).asScala().toSeq());

sparkDataSet.show();{code}
Expected:

 

Results are shown

 

 

Actual: the following exception is thrown
{code:java}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: 
windowspecdefinition(O003#3, O006#6, specifiedwindowframe(RowFrame, 
unboundedpreceding$(), unboundedfollowing$())) at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
scala.collection.immutable.List.map(List.scala:285) at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$.normalizeExprId(QueryPlan.scala:288)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:232)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:226)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:226)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPl

[jira] [Updated] (SPARK-24661) Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeException

2018-06-26 Thread David Mavashev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mavashev updated SPARK-24661:
---
Description: 
Steps to reproduce:

Creating a data set:

 
{code:java}
List simpleWindowColumns = new ArrayList();
simpleWindowColumns.add("column1");
simpleWindowColumns.add("column2");

Map expressionsWithAliasesEntrySet = new HashMap);

expressionsWithAliasesEntrySet.put("count(id)", "count_column");

DataFrameReader reader = sparkSession.read().format("csv");
Dataset sparkDataSet = reader.option("header", 
"true").load("/path/to/data/data.csv");

//Invoking cached:
sparkDataSet = sparkDataSet.cache()

//Creating window spec with 2 columns:
WindowSpec window = 
Window.partitionBy(JavaConverters.asScalaIteratorConverter(simpleWindowColumns.stream().map(item->sparkDataSet.col(item)).iterator()).asScala().toSeq());

sparkDataSet = 
sparkDataSet.withColumns(JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->item.getKey()).collect(Collectors.toList()).iterator()).asScala().toSeq(),

  
JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->new
 
Column(item.getValue()).over(finalWindow)).collect(Collectors.toList()).iterator()).asScala().toSeq());

sparkDataSet.show();{code}
Expected:

 

Results are shown

 

 

Actual: the following exception is thrown
{code:java}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: 
windowspecdefinition(O003#3, O006#6, specifiedwindowframe(RowFrame, 
unboundedpreceding$(), unboundedfollowing$())) at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
scala.collection.immutable.List.map(List.scala:285) at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$.normalizeExprId(QueryPlan.scala:288)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:232)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:226)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:226)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
scala.collection.TraversableLike$$anonfun$ma

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-06-26 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523888#comment-16523888
 ] 

Hari Sekhon commented on SPARK-6305:


Log4j 2.x would really help with Spark logging integration to ELK as there are 
a lot of things that just don't work properly in Log4j 1.x like 
layout.ConversionPattern for constructing JSON enriched logs, such as logging 
user and app names to distinguish jobs and provide much needed search 
usability. This is simply ignored in the SocketAppender in Log4j 1.x :-/ while 
SyslogAppender respects ConversionPattern but then splits all Java Exceptions 
in to multiple syslog logs so the JSON no longer parses and routes to the right 
indices for the Yarn queue, nor can you reassemble the exception logs using 
multiline codec at the other end as you'd end up with corrupted input streams 
from multiple loggers) :-/

Running Filebeats everywhere instead seems like overkill compared to being able 
to enable logging for debugging jobs on an ad-hoc basis to a Logstash sink that 
works using much better Log4j 2.x output appenders.

I hope someone finally manages to sort this out as it's years overdue given 
Log4j 1.x was end of life 3 years ago and there is a big jump in capabilities 
between Log4j 1.x and 2.x, both in the number of appenders as well as 
completeness of even the old appenders such as the SocketAppender as mentioned 
above.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6305) Add support for log4j 2.x to Spark

2018-06-26 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523888#comment-16523888
 ] 

Hari Sekhon edited comment on SPARK-6305 at 6/26/18 3:47 PM:
-

Log4j 2.x would really help with Spark logging integration to ELK as there are 
a lot of things that just don't work properly in Log4j 1.x like 
layout.ConversionPattern for constructing JSON enriched logs, such as logging 
user and app names to distinguish jobs and provide much needed search 
usability. This is simply ignored in the SocketAppender in Log4j 1.x, while 
SyslogAppender respects ConversionPattern but then splits all Java Exceptions 
in to multiple syslog logs so the JSON no longer parses and routes to the right 
indices for the Yarn queue, nor can you reassemble the exception logs using 
multiline codec at the other end as you'd end up with corrupted input streams 
from multiple loggers) :-/

Running Filebeats everywhere instead seems like overkill compared to being able 
to enable logging for debugging jobs on an ad-hoc basis to a Logstash sink that 
works using much better Log4j 2.x output appenders.

I hope someone finally manages to sort this out as it's years overdue given 
Log4j 1.x was end of life 3 years ago and there is a big jump in capabilities 
between Log4j 1.x and 2.x, both in the number of appenders as well as 
completeness of even the old appenders such as the SocketAppender as mentioned 
above.


was (Author: harisekhon):
Log4j 2.x would really help with Spark logging integration to ELK as there are 
a lot of things that just don't work properly in Log4j 1.x like 
layout.ConversionPattern for constructing JSON enriched logs, such as logging 
user and app names to distinguish jobs and provide much needed search 
usability. This is simply ignored in the SocketAppender in Log4j 1.x :-/ while 
SyslogAppender respects ConversionPattern but then splits all Java Exceptions 
in to multiple syslog logs so the JSON no longer parses and routes to the right 
indices for the Yarn queue, nor can you reassemble the exception logs using 
multiline codec at the other end as you'd end up with corrupted input streams 
from multiple loggers) :-/

Running Filebeats everywhere instead seems like overkill compared to being able 
to enable logging for debugging jobs on an ad-hoc basis to a Logstash sink that 
works using much better Log4j 2.x output appenders.

I hope someone finally manages to sort this out as it's years overdue given 
Log4j 1.x was end of life 3 years ago and there is a big jump in capabilities 
between Log4j 1.x and 2.x, both in the number of appenders as well as 
completeness of even the old appenders such as the SocketAppender as mentioned 
above.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24653) Flaky test "JoinSuite.test SortMergeJoin (with spill)"

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24653:


Assignee: (was: Apache Spark)

> Flaky test "JoinSuite.test SortMergeJoin (with spill)"
> --
>
> Key: SPARK-24653
> URL: https://issues.apache.org/jira/browse/SPARK-24653
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We've run into failures in this test in our internal jobs a few times. They 
> look like this:
> {noformat}
> java.lang.AssertionError: assertion failed: expected full outer join to not 
> spill, but did
>   at scala.Predef$.assert(Predef.scala:170)
>   at org.apache.spark.TestUtils$.assertNotSpilled(TestUtils.scala:189)
>   at 
> org.apache.spark.sql.JoinSuite$$anonfun$23$$anonfun$apply$mcV$sp$16.apply$mcV$sp(JoinSuite.scala:734)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:108)
> {noformat}
> I looked on the riselab jenkins and couldn't find a failure, so filing with a 
> low priority.
> I did notice a possible race in the code that could explain the failure. Will 
> send a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24653) Flaky test "JoinSuite.test SortMergeJoin (with spill)"

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24653:


Assignee: Apache Spark

> Flaky test "JoinSuite.test SortMergeJoin (with spill)"
> --
>
> Key: SPARK-24653
> URL: https://issues.apache.org/jira/browse/SPARK-24653
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> We've run into failures in this test in our internal jobs a few times. They 
> look like this:
> {noformat}
> java.lang.AssertionError: assertion failed: expected full outer join to not 
> spill, but did
>   at scala.Predef$.assert(Predef.scala:170)
>   at org.apache.spark.TestUtils$.assertNotSpilled(TestUtils.scala:189)
>   at 
> org.apache.spark.sql.JoinSuite$$anonfun$23$$anonfun$apply$mcV$sp$16.apply$mcV$sp(JoinSuite.scala:734)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:108)
> {noformat}
> I looked on the riselab jenkins and couldn't find a failure, so filing with a 
> low priority.
> I did notice a possible race in the code that could explain the failure. Will 
> send a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-26 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523918#comment-16523918
 ] 

Marcelo Vanzin commented on SPARK-24631:


Sorry for the noise, pasted the wrong bug number in my PR.

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24631:
--

Assignee: (was: Marcelo Vanzin)

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24631:
--

Assignee: Marcelo Vanzin

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24653) Flaky test "JoinSuite.test SortMergeJoin (with spill)"

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523916#comment-16523916
 ] 

Apache Spark commented on SPARK-24653:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21639

> Flaky test "JoinSuite.test SortMergeJoin (with spill)"
> --
>
> Key: SPARK-24653
> URL: https://issues.apache.org/jira/browse/SPARK-24653
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We've run into failures in this test in our internal jobs a few times. They 
> look like this:
> {noformat}
> java.lang.AssertionError: assertion failed: expected full outer join to not 
> spill, but did
>   at scala.Predef$.assert(Predef.scala:170)
>   at org.apache.spark.TestUtils$.assertNotSpilled(TestUtils.scala:189)
>   at 
> org.apache.spark.sql.JoinSuite$$anonfun$23$$anonfun$apply$mcV$sp$16.apply$mcV$sp(JoinSuite.scala:734)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:108)
> {noformat}
> I looked on the riselab jenkins and couldn't find a failure, so filing with a 
> low priority.
> I did notice a possible race in the code that could explain the failure. Will 
> send a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24631:
---
Comment: was deleted

(was: User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21639)

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24537:


Assignee: (was: Apache Spark)

> Add array_remove / array_zip / map_from_arrays / array_distinct
> ---
>
> Key: SPARK-24537
> URL: https://issues.apache.org/jira/browse/SPARK-24537
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * array_remove   -SPARK-23920-
>  * array_zip   -SPARK-23931-
>  * map_from_arrays   -SPARK-23933-
>  * array_distinct   -SPARK-23912-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24537:


Assignee: Apache Spark

> Add array_remove / array_zip / map_from_arrays / array_distinct
> ---
>
> Key: SPARK-24537
> URL: https://issues.apache.org/jira/browse/SPARK-24537
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add R versions of 
>  * array_remove   -SPARK-23920-
>  * array_zip   -SPARK-23931-
>  * map_from_arrays   -SPARK-23933-
>  * array_distinct   -SPARK-23912-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524114#comment-16524114
 ] 

Apache Spark commented on SPARK-24537:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21645

> Add array_remove / array_zip / map_from_arrays / array_distinct
> ---
>
> Key: SPARK-24537
> URL: https://issues.apache.org/jira/browse/SPARK-24537
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * array_remove   -SPARK-23920-
>  * array_zip   -SPARK-23931-
>  * map_from_arrays   -SPARK-23933-
>  * array_distinct   -SPARK-23912-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24658) Remove workaround for ANTLR bug

2018-06-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24658.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.4.0

> Remove workaround for ANTLR bug
> ---
>
> Key: SPARK-24658
> URL: https://issues.apache.org/jira/browse/SPARK-24658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Critical
> Fix For: 2.4.0
>
>
> Issue [antlr/antlr4#781|https://github.com/antlr/antlr4/issues/781] has 
> already been fixed, so the workaround of extracting the pattern into a 
> separate rule is no longer needed. The presto already removed it: 
> https://github.com/prestodb/presto/pull/10744.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524233#comment-16524233
 ] 

Dongjoon Hyun commented on SPARK-24530:
---

[~mengxr] and [~hyukjin.kwon]. My environment is macOS, *python 3*, Sphinx 
v1.6.3.  
{code}
~/s/p/docs:master$ make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.6.3
making output directory...
...
{code}

According to the above reports, many combinations of Python 2.7 and Sphinx 
looks broken.

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524233#comment-16524233
 ] 

Dongjoon Hyun edited comment on SPARK-24530 at 6/26/18 9:49 PM:


[~mengxr] and [~hyukjin.kwon]. My environment is macOS, *python 3*, Sphinx 
v1.6.3.  
{code}
~/s/p/docs:master$ make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.6.3
making output directory...
...
{code}

According to the above reports, many combinations of Python 2.7 and Sphinx 
looks broken?


was (Author: dongjoon):
[~mengxr] and [~hyukjin.kwon]. My environment is macOS, *python 3*, Sphinx 
v1.6.3.  
{code}
~/s/p/docs:master$ make html
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v1.6.3
making output directory...
...
{code}

According to the above reports, many combinations of Python 2.7 and Sphinx 
looks broken.

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24423) Add a new option `query` for JDBC sources

2018-06-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24423.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.4.0

> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
> {code} 
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", "dbName.tableName")
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option.   
> {code} 
>  val query = """ (select * from tableName limit 10) as tmp """
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
> {code} 
>  val query = """select * from tableName limit 10"""
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*{color:#ff}query{color}*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24662) Structured Streaming should support LIMIT

2018-06-26 Thread Mukul Murthy (JIRA)
Mukul Murthy created SPARK-24662:


 Summary: Structured Streaming should support LIMIT
 Key: SPARK-24662
 URL: https://issues.apache.org/jira/browse/SPARK-24662
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Mukul Murthy


Make structured streams support the LIMIT operator. 

This will undo SPARK-24525 as the limit operator would be a superior solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6237) Support uploading blocks > 2GB as a stream

2018-06-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-6237.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21346
[https://github.com/apache/spark/pull/21346]

> Support uploading blocks > 2GB as a stream
> --
>
> Key: SPARK-6237
> URL: https://issues.apache.org/jira/browse/SPARK-6237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6237) Support uploading blocks > 2GB as a stream

2018-06-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-6237:
-

Assignee: Imran Rashid

> Support uploading blocks > 2GB as a stream
> --
>
> Key: SPARK-6237
> URL: https://issues.apache.org/jira/browse/SPARK-6237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24208) Cannot resolve column in self join after applying Pandas UDF

2018-06-26 Thread Stu (Michael Stewart) (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524341#comment-16524341
 ] 

Stu (Michael Stewart) commented on SPARK-24208:
---

[~hyukjin.kwon] I can confirm I ran into this issue too. The issue, as the OP 
noted, stems from having a pandas GROUPED_MAP UDF applied to a DF prior to 
attempting a self-join of said DF against itself. Beyond that I've not 
investigated.

> Cannot resolve column in self join after applying Pandas UDF
> 
>
> Key: SPARK-24208
> URL: https://issues.apache.org/jira/browse/SPARK-24208
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: AWS EMR 5.13.0
> Amazon Hadoop distribution 2.8.3
> Spark 2.3.0
> Pandas 0.22.0
>Reporter: Rafal Ganczarek
>Priority: Minor
>
> I noticed that after applying Pandas UDF function, a self join of resulted 
> DataFrame will fail to resolve columns. The workaround that I found is to 
> recreate DataFrame with its RDD and schema.
> Below you can find a Python code that reproduces the issue.
> {code:java}
> from pyspark import Row
> import pyspark.sql.functions as F
> @F.pandas_udf('key long, col string', F.PandasUDFType.GROUPED_MAP)
> def dummy_pandas_udf(df):
> return df[['key','col']]
> df = spark.createDataFrame([Row(key=1,col='A'), Row(key=1,col='B'), 
> Row(key=2,col='C')])
> # transformation that causes the issue
> df = df.groupBy('key').apply(dummy_pandas_udf)
> # WORKAROUND that fixes the issue
> # df = spark.createDataFrame(df.rdd, df.schema)
> df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> {code}
> If workaround line is commented out, then above code fails with the following 
> error:
> {code:java}
> AnalysisExceptionTraceback (most recent call last)
>  in ()
>  12 # df = spark.createDataFrame(df.rdd, df.schema)
>  13 
> ---> 14 df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> /usr/lib/spark/python/pyspark/sql/dataframe.py in join(self, other, on, how)
> 929 on = self._jseq([])
> 930 assert isinstance(how, basestring), "how should be 
> basestring"
> --> 931 jdf = self._jdf.join(other._jdf, on, how)
> 932 return DataFrame(jdf, self.sql_ctx)
> 933 
> /usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, 
> *args)
>1158 answer = self.gateway_client.send_command(command)
>1159 return_value = get_return_value(
> -> 1160 answer, self.gateway_client, self.target_id, self.name)
>1161 
>1162 for temp_arg in temp_args:
> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  67  
> e.java_exception.getStackTrace()))
>  68 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 69 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
>  70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
>  71 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
> AnalysisException: u"cannot resolve '`temp0.key`' given input columns: 
> [temp0.key, temp0.col];;\n'Join Inner, ('temp0.key = 'temp1.key)\n:- 
> AnalysisBarrier\n: +- SubqueryAlias temp0\n:+- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n:   +- Project [key#4099L, col#4098, 
> key#4099L]\n:  +- LogicalRDD [col#4098, key#4099L], false\n+- 
> AnalysisBarrier\n  +- SubqueryAlias temp1\n +- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n+- Project [key#4099L, col#4098, 
> key#4099L]\n   +- LogicalRDD [col#4098, key#4099L], false\n"
> {code}
> The same happens, if instead of DataFrame API I use Spark SQL to do a self 
> join:
> {code:java}
> # df is a DataFrame after applying dummy_pandas_udf
> df.createOrReplaceTempView('df')
> spark.sql('''
> SELECT 
> *
> FROM df temp0
> LEFT JOIN df temp1 ON
> temp0.key == temp1.key
> ''').show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"

2018-06-26 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24663:
--

 Summary: Flaky test: StreamingContextSuite "stop slow receiver 
gracefully"
 Key: SPARK-24663
 URL: https://issues.apache.org/jira/browse/SPARK-24663
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


This is another test that sometimes fails on our build machines, although I 
can't find failures on the riselab jenkins servers. Failure looks like:

{noformat}
org.scalatest.exceptions.TestFailedException: 0 was not greater than 0
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
  at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
  at 
org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356)
  at 
org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
  at 
org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
{noformat}

The test fails in about 2s, while a successful run generally takes 15s. Looking 
at the logs, the receiver hasn't even started when things fail, which points at 
a race during test initialization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-06-26 Thread Perry Chu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perry Chu updated SPARK-24447:
--
Priority: Minor  (was: Major)

> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Minor
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context. 
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #
> print(sims.entries.context) # PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
> partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-06-26 Thread Perry Chu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perry Chu updated SPARK-24447:
--
Description: 
The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() 
appears to be losing track of the spark context if spark is stopped and 
restarted in pyspark.

I'm pretty new to spark - not sure if the problem is on the python side or the 
scala side - would appreciate someone more experienced taking a look.

This snippet should reproduce the error:
{code:java}
from pyspark.mllib.linalg.distributed import RowMatrix

rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
matrix = RowMatrix(rows)
sims = matrix.columnSimilarities()

## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
print(sims.numRows(),sims.numCols())

## This throws an error (stack trace below)
print(sims.entries.first())

## Later I tried this
print(rows.context) #
print(sims.entries.context) #, 
then throws an error{code}
Error stack trace
{code:java}
---
AttributeError Traceback (most recent call last)
 in ()
> 1 sims.entries.first()

/usr/lib/spark/python/pyspark/rdd.py in first(self)
1374 ValueError: RDD is empty
1375 """
-> 1376 rs = self.take(1)
1377 if rs:
1378 return rs[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1356
1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
1359
1360 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
partitions, allowLocal)
999 # SparkContext#runJob.
1000 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
1003

AttributeError: 'NoneType' object has no attribute 'sc'
{code}
PySpark columnSimilarities documentation

[http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities]

  was:
The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() 
appears to be losing track of the spark context. 

I'm pretty new to spark - not sure if the problem is on the python side or the 
scala side - would appreciate someone more experienced taking a look.

This snippet should reproduce the error:
{code:java}
from pyspark.mllib.linalg.distributed import RowMatrix

rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
matrix = RowMatrix(rows)
sims = matrix.columnSimilarities()

## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
print(sims.numRows(),sims.numCols())

## This throws an error (stack trace below)
print(sims.entries.first())

## Later I tried this
print(rows.context) #
print(sims.entries.context) #, 
then throws an error{code}
Error stack trace
{code:java}
---
AttributeError Traceback (most recent call last)
 in ()
> 1 sims.entries.first()

/usr/lib/spark/python/pyspark/rdd.py in first(self)
1374 ValueError: RDD is empty
1375 """
-> 1376 rs = self.take(1)
1377 if rs:
1378 return rs[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1356
1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
1359
1360 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
partitions, allowLocal)
999 # SparkContext#runJob.
1000 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
1003

AttributeError: 'NoneType' object has no attribute 'sc'
{code}
PySpark columnSimilarities documentation

http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities


> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Minor
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context if spark is stopped and restarted in pyspark.
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.spark

[jira] [Updated] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-06-26 Thread Perry Chu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perry Chu updated SPARK-24447:
--
Description: 
The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() 
appears to be losing track of the spark context if spark is stopped and 
restarted in pyspark.

I'm pretty new to spark - not sure if the problem is on the python side or the 
scala side - would appreciate someone more experienced taking a look.

This snippet should reproduce the error:
{code:java}
import pyspark
from pyspark.mllib.linalg.distributed import RowMatrix

spark.stop()
spark = pyspark.sql.SparkSession.builder.getOrCreate()

rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
matrix = RowMatrix(rows)
sims = matrix.columnSimilarities()

## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
print(sims.numRows(),sims.numCols())

## This throws an error (stack trace below)
print(sims.entries.first())

## Later I tried this
print(rows.context) #
print(sims.entries.context) #, 
then throws an error{code}
Error stack trace
{code:java}
---
AttributeError Traceback (most recent call last)
 in ()
> 1 sims.entries.first()

/usr/lib/spark/python/pyspark/rdd.py in first(self)
1374 ValueError: RDD is empty
1375 """
-> 1376 rs = self.take(1)
1377 if rs:
1378 return rs[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1356
1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
1359
1360 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
partitions, allowLocal)
999 # SparkContext#runJob.
1000 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
1003

AttributeError: 'NoneType' object has no attribute 'sc'
{code}
PySpark columnSimilarities documentation

[http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities]

  was:
The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() 
appears to be losing track of the spark context if spark is stopped and 
restarted in pyspark.

I'm pretty new to spark - not sure if the problem is on the python side or the 
scala side - would appreciate someone more experienced taking a look.

This snippet should reproduce the error:
{code:java}
from pyspark.mllib.linalg.distributed import RowMatrix

rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
matrix = RowMatrix(rows)
sims = matrix.columnSimilarities()

## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
print(sims.numRows(),sims.numCols())

## This throws an error (stack trace below)
print(sims.entries.first())

## Later I tried this
print(rows.context) #
print(sims.entries.context) #, 
then throws an error{code}
Error stack trace
{code:java}
---
AttributeError Traceback (most recent call last)
 in ()
> 1 sims.entries.first()

/usr/lib/spark/python/pyspark/rdd.py in first(self)
1374 ValueError: RDD is empty
1375 """
-> 1376 rs = self.take(1)
1377 if rs:
1378 return rs[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1356
1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
1359
1360 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
partitions, allowLocal)
999 # SparkContext#runJob.
1000 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
1003

AttributeError: 'NoneType' object has no attribute 'sc'
{code}
PySpark columnSimilarities documentation

[http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities]


> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Minor
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context if spark is stopped and restarted in pyspark.
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> T

[jira] [Resolved] (SPARK-24659) GenericArrayData.equals should respect element type differences

2018-06-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24659.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21643
[https://github.com/apache/spark/pull/21643]

> GenericArrayData.equals should respect element type differences
> ---
>
> Key: SPARK-24659
> URL: https://issues.apache.org/jira/browse/SPARK-24659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.4.0
>
>
> Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect 
> element type differences, due to a caveat in Scala's {{==}} operator.
> e.g. {{new GenericArrayData(Array[Int](123)).equals(new 
> GenericArrayData(Array[Long](123L)))}} currently returns true. But that's 
> against the semantics of Spark SQL's array type, where {{array}} and 
> {{array}} are considered to be incompatible types and thus should never 
> be equal.
> This ticket proposes to fix the implementation of {{GenericArrayData.equals}} 
> so that it's more aligned to Spark SQL's array type semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24659) GenericArrayData.equals should respect element type differences

2018-06-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24659:
---

Assignee: Kris Mok

> GenericArrayData.equals should respect element type differences
> ---
>
> Key: SPARK-24659
> URL: https://issues.apache.org/jira/browse/SPARK-24659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.4.0
>
>
> Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect 
> element type differences, due to a caveat in Scala's {{==}} operator.
> e.g. {{new GenericArrayData(Array[Int](123)).equals(new 
> GenericArrayData(Array[Long](123L)))}} currently returns true. But that's 
> against the semantics of Spark SQL's array type, where {{array}} and 
> {{array}} are considered to be incompatible types and thus should never 
> be equal.
> This ticket proposes to fix the implementation of {{GenericArrayData.equals}} 
> so that it's more aligned to Spark SQL's array type semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23014) Migrate MemorySink fully to v2

2018-06-26 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524481#comment-16524481
 ] 

Richard Yu commented on SPARK-23014:


Hi [~joseph.torres] Are you still working on this PR? It seems that there was 
no progress on the PR for a while now.

> Migrate MemorySink fully to v2
> --
>
> Key: SPARK-23014
> URL: https://issues.apache.org/jira/browse/SPARK-23014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> There's already a MemorySinkV2, but its use is controlled by a flag. We need 
> to remove the V1 sink and always use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24664) Column support name getter

2018-06-26 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-24664:


 Summary: Column support name getter
 Key: SPARK-24664
 URL: https://issues.apache.org/jira/browse/SPARK-24664
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: zhengruifeng


In spark-24557 (https://github.com/apache/spark/pull/21563), we found that it 
will be convenient if column supports name getter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24605) size(null) should return null

2018-06-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24605:
---

Assignee: Maxim Gekk

> size(null) should return null
> -
>
> Key: SPARK-24605
> URL: https://issues.apache.org/jira/browse/SPARK-24605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The default behavior size(null) == -1 is a big problem for several reasons:
> # It is inconsistent with how SQL functions handle nulls.
> # It is an extreme violation of [the Principle of Least 
> Astonishment|https://en.wikipedia.org/wiki/Principle_of_least_astonishment] 
> (POLA)
> # It is not called out anywhere in the Spark docs or even [the Hive 
> docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> # It can lead to subtle bugs in analytics.
> For example, our client discovered this behavior while investigating 
> post-click user engagement in their AdTech system. The schema was per ad 
> placement and post-click user engagements were in an array of structs. The 
> culprit was 
> df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), 
> ...), which subtracted 1 for every click without post-click engagement. 
> Luckily, the behavior led to negative engagement counts in some periods, 
> which alerted them to the problem and this bizarre behavior.
> Current behavior Spark inherited from Hive. The most consistent behavior, 
> ignoring the insanity that Hive created in the first place, is for size(null) 
> to behave as length(null), which returns null. This handles the aggregation 
> case with sum/avg, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24605) size(null) should return null

2018-06-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24605.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21598
[https://github.com/apache/spark/pull/21598]

> size(null) should return null
> -
>
> Key: SPARK-24605
> URL: https://issues.apache.org/jira/browse/SPARK-24605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> The default behavior size(null) == -1 is a big problem for several reasons:
> # It is inconsistent with how SQL functions handle nulls.
> # It is an extreme violation of [the Principle of Least 
> Astonishment|https://en.wikipedia.org/wiki/Principle_of_least_astonishment] 
> (POLA)
> # It is not called out anywhere in the Spark docs or even [the Hive 
> docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> # It can lead to subtle bugs in analytics.
> For example, our client discovered this behavior while investigating 
> post-click user engagement in their AdTech system. The schema was per ad 
> placement and post-click user engagements were in an array of structs. The 
> culprit was 
> df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), 
> ...), which subtracted 1 for every click without post-click engagement. 
> Luckily, the behavior led to negative engagement counts in some periods, 
> which alerted them to the problem and this bizarre behavior.
> Current behavior Spark inherited from Hive. The most consistent behavior, 
> ignoring the insanity that Hive created in the first place, is for size(null) 
> to behave as length(null), which returns null. This handles the aggregation 
> case with sum/avg, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23927) High-order function: sequence

2018-06-26 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23927:
-

Assignee: Alex Vayda

> High-order function: sequence
> -
>
> Key: SPARK-23927
> URL: https://issues.apache.org/jira/browse/SPARK-23927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Alex Vayda
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> * sequence(start, stop) → array
> Generate a sequence of integers from start to stop, incrementing by 1 if 
> start is less than or equal to stop, otherwise -1.
> * sequence(start, stop, step) → array
> Generate a sequence of integers from start to stop, incrementing by step.
> * sequence(start, stop) → array
> Generate a sequence of dates from start date to stop date, incrementing by 1 
> day if start date is less than or equal to stop date, otherwise -1 day.
> * sequence(start, stop, step) → array
> Generate a sequence of dates from start to stop, incrementing by step. The 
> type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.
> * sequence(start, stop, step) → array
> Generate a sequence of timestamps from start to stop, incrementing by step. 
> The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO 
> MONTH.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23927) High-order function: sequence

2018-06-26 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23927.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21155
[https://github.com/apache/spark/pull/21155]

> High-order function: sequence
> -
>
> Key: SPARK-23927
> URL: https://issues.apache.org/jira/browse/SPARK-23927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Alex Vayda
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> * sequence(start, stop) → array
> Generate a sequence of integers from start to stop, incrementing by 1 if 
> start is less than or equal to stop, otherwise -1.
> * sequence(start, stop, step) → array
> Generate a sequence of integers from start to stop, incrementing by step.
> * sequence(start, stop) → array
> Generate a sequence of dates from start date to stop date, incrementing by 1 
> day if start date is less than or equal to stop date, otherwise -1 day.
> * sequence(start, stop, step) → array
> Generate a sequence of dates from start to stop, incrementing by step. The 
> type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.
> * sequence(start, stop, step) → array
> Generate a sequence of timestamps from start to stop, incrementing by step. 
> The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO 
> MONTH.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-26 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524510#comment-16524510
 ] 

Xiangrui Meng commented on SPARK-24530:
---

Confirmed that macOS, python 3, and Sphinx v1.6.6 can produce correct doc on my 
machine. I didn't find any reports on Sphinx github. So if we could make a 
minimal reproducible example, we should report the issue to Sphinx. On our 
side, we should update the release procedure doc to use Python 3 to generate 
docs. We should also update the official docs that are broken (2.1.2, 2.2.1, 
2.3.1).

[~hyukjin.kwon] Do you have time to take this ticket? (feel free to say no if 
you are busy:)

cc: [~smilegator]

 

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (using Python 2?)

2018-06-26 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24530:
--
Summary: Sphinx doesn't render autodoc_docstring_signature correctly (using 
Python 2?)  (was: pyspark.ml doesn't generate class docs correctly)

> Sphinx doesn't render autodoc_docstring_signature correctly (using Python 2?)
> -
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-26 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24530:
--
Summary: Sphinx doesn't render autodoc_docstring_signature correctly (with 
Python 2?) and pyspark.ml docs are broken  (was: Sphinx doesn't render 
autodoc_docstring_signature correctly (using Python 2?))

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23927) High-order function: sequence

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524519#comment-16524519
 ] 

Apache Spark commented on SPARK-23927:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21646

> High-order function: sequence
> -
>
> Key: SPARK-23927
> URL: https://issues.apache.org/jira/browse/SPARK-23927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Alex Vayda
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> * sequence(start, stop) → array
> Generate a sequence of integers from start to stop, incrementing by 1 if 
> start is less than or equal to stop, otherwise -1.
> * sequence(start, stop, step) → array
> Generate a sequence of integers from start to stop, incrementing by step.
> * sequence(start, stop) → array
> Generate a sequence of dates from start date to stop date, incrementing by 1 
> day if start date is less than or equal to stop date, otherwise -1 day.
> * sequence(start, stop, step) → array
> Generate a sequence of dates from start to stop, incrementing by step. The 
> type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.
> * sequence(start, stop, step) → array
> Generate a sequence of timestamps from start to stop, incrementing by step. 
> The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO 
> MONTH.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21335) support un-aliased subquery

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524527#comment-16524527
 ] 

Apache Spark commented on SPARK-21335:
--

User 'cnZach' has created a pull request for this issue:
https://github.com/apache/spark/pull/21647

> support un-aliased subquery
> ---
>
> Key: SPARK-21335
> URL: https://issues.apache.org/jira/browse/SPARK-21335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> un-aliased subquery is supported by Spark SQL for a long time. Its semantic 
> was not well defined and has confusing behaviors, and it's not a standard SQL 
> syntax, so we disallowed it in 
> https://issues.apache.org/jira/browse/SPARK-20690 .
> However, this is a breaking change, and we do have existing queries using 
> un-aliased subquery. We should add the support back and fix its semantic.
> After the fix, there is no syntax change from branch 2.2 to master, but we 
> invalid a weird use case:
> {{SELECT v.i from (SELECT i FROM v)}}. Now this query will throw analysis 
> exception because users should not be able to use the qualifier inside a 
> subquery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524544#comment-16524544
 ] 

Hyukjin Kwon edited comment on SPARK-24530 at 6/27/18 4:01 AM:
---

Will take a look on this weekends. Please go ahead if anyone finds some time 
till then :-).


was (Author: hyukjin.kwon):
Will take a look on the weekends. Please go ahead if anyone finds some time 
till then :-).

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524544#comment-16524544
 ] 

Hyukjin Kwon commented on SPARK-24530:
--

Will take a look on the weekends. Please go ahead if anyone finds some time 
till then :-).

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-26 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524554#comment-16524554
 ] 

Xiao Li commented on SPARK-24530:
-

[~hyukjin.kwon]  Thanks for helping this!

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column

2018-06-26 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524590#comment-16524590
 ] 

Reynold Xin commented on SPARK-24642:
-

Do we want this as an aggregate function? I'm thinking it's better to just take 
a string and infers the schema on the string.

How would the query you provide compile if it is an aggregate function?

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs

2018-06-26 Thread Li Yuanjian (JIRA)
Li Yuanjian created SPARK-24665:
---

 Summary: Add SQLConf in PySpark to manage all sql configs
 Key: SPARK-24665
 URL: https://issues.apache.org/jira/browse/SPARK-24665
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Li Yuanjian


With new config adding in PySpark, we currently get them by hard coding the 
config name and default value. We should move all the configs into a Class like 
what we did in Spark SQL Conf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24665:


Assignee: Apache Spark

> Add SQLConf in PySpark to manage all sql configs
> 
>
> Key: SPARK-24665
> URL: https://issues.apache.org/jira/browse/SPARK-24665
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Apache Spark
>Priority: Major
>
> With new config adding in PySpark, we currently get them by hard coding the 
> config name and default value. We should move all the configs into a Class 
> like what we did in Spark SQL Conf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs

2018-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524603#comment-16524603
 ] 

Apache Spark commented on SPARK-24665:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/21648

> Add SQLConf in PySpark to manage all sql configs
> 
>
> Key: SPARK-24665
> URL: https://issues.apache.org/jira/browse/SPARK-24665
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
>
> With new config adding in PySpark, we currently get them by hard coding the 
> config name and default value. We should move all the configs into a Class 
> like what we did in Spark SQL Conf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs

2018-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24665:


Assignee: (was: Apache Spark)

> Add SQLConf in PySpark to manage all sql configs
> 
>
> Key: SPARK-24665
> URL: https://issues.apache.org/jira/browse/SPARK-24665
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
>
> With new config adding in PySpark, we currently get them by hard coding the 
> config name and default value. We should move all the configs into a Class 
> like what we did in Spark SQL Conf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23102) Migrate kafka sink

2018-06-26 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524615#comment-16524615
 ] 

Richard Yu commented on SPARK-23102:


Just a question: I have noted that ```KafkaStreamWriter.scala``` is already an 
implementation of a DataSourceV2 sink. So are we going to continue to evolve 
that class, or are we sticking with ```KafkaWriter```?

> Migrate kafka sink
> --
>
> Key: SPARK-23102
> URL: https://issues.apache.org/jira/browse/SPARK-23102
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23102) Migrate kafka sink

2018-06-26 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524615#comment-16524615
 ] 

Richard Yu edited comment on SPARK-23102 at 6/27/18 6:03 AM:
-

[~joseph.torres] Just a question: I have noted that 
```KafkaStreamWriter.scala``` is already an implementation of a DataSourceV2 
sink. So are we going to continue to evolve that class, or are we sticking with 
```KafkaWriter```?


was (Author: yohan123):
Just a question: I have noted that ```KafkaStreamWriter.scala``` is already an 
implementation of a DataSourceV2 sink. So are we going to continue to evolve 
that class, or are we sticking with ```KafkaWriter```?

> Migrate kafka sink
> --
>
> Key: SPARK-23102
> URL: https://issues.apache.org/jira/browse/SPARK-23102
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23102) Migrate kafka sink

2018-06-26 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524615#comment-16524615
 ] 

Richard Yu edited comment on SPARK-23102 at 6/27/18 6:04 AM:
-

[~joseph.torres] Just a question: I have noted that {{KafkaStreamWriter}} is 
already an implementation of a DataSourceV2 sink. So are we going to continue 
to evolve that class, or are we sticking with  {{KafkaWriter}}? 

 


was (Author: yohan123):
[~joseph.torres] Just a question: I have noted that 
```KafkaStreamWriter.scala``` is already an implementation of a DataSourceV2 
sink. So are we going to continue to evolve that class, or are we sticking with 
```KafkaWriter```?

> Migrate kafka sink
> --
>
> Key: SPARK-23102
> URL: https://issues.apache.org/jira/browse/SPARK-23102
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-23102) Migrate kafka sink

2018-06-26 Thread Richard Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-23102:
---
Comment: was deleted

(was: [~joseph.torres] Just a question: I have noted that {{KafkaStreamWriter}} 
is already an implementation of a DataSourceV2 sink. So are we going to 
continue to evolve that class, or are we sticking with  {{KafkaWriter}}? 

 )

> Migrate kafka sink
> --
>
> Key: SPARK-23102
> URL: https://issues.apache.org/jira/browse/SPARK-23102
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23102) Migrate kafka sink

2018-06-26 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524628#comment-16524628
 ] 

Richard Yu commented on SPARK-23102:


Hi [~joseph.torres] Mind if I take this JIRA?

> Migrate kafka sink
> --
>
> Key: SPARK-23102
> URL: https://issues.apache.org/jira/browse/SPARK-23102
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org