[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table
[ https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523344#comment-16523344 ] Ohad Raviv commented on SPARK-24528: Hi, well it took me some time to get to it, but here are my design conclusions: # currently all the file scans are done with FileScanRDD. in its current implementation it gets a list of files in each partition and iterates the one after the other. # that means we probably need another FileScanRDD that can "open" all the files and iterate them in a merge sort manner (like maintaing a heap to know what's the next file to iterate from). # the FileScanRDD is created in FileSourceScanExec.createBucketedReadRDD if the data is bucketed. # FileSourceScanExec is created in FileSourceStrategy. # that means we could understand if the data read output is required to be sorted in FileSourceStrategy and percolate this knowledge to the creation of the new FileScan(Sorted?)RDD. # thing to note here is to enable this sorted reading only if it's required otherwise it will cause performance issue. please tell me WDYT. > Missing optimization for Aggregations/Windowing on a bucketed table > --- > > Key: SPARK-24528 > URL: https://issues.apache.org/jira/browse/SPARK-24528 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Ohad Raviv >Priority: Major > > Closely related to SPARK-24410, we're trying to optimize a very common use > case we have of getting the most updated row by id from a fact table. > We're saving the table bucketed to skip the shuffle stage, but we're still > "waste" time on the Sort operator evethough the data is already sorted. > here's a good example: > {code:java} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("key", "t1") > .saveAsTable("a1"){code} > {code:java} > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, > key#24L, t1, t1#25L, t2, t2#26L))]) > +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, > t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))]) > +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, > Format: Parquet, Location: ...{code} > > and here's a bad example, but more realistic: > {code:java} > sparkSession.sql("set spark.sql.shuffle.partitions=2") > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, > key#32L, t1, t1#33L, t2, t2#34L))]) > +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, > t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))]) > +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0 > +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, > Format: Parquet, Location: ... > {code} > > I've traced the problem to DataSourceScanExec#235: > {code:java} > val sortOrder = if (sortColumns.nonEmpty) { > // In case of bucketing, its possible to have multiple files belonging to > the > // same bucket in a given relation. Each of these files are locally sorted > // but those files combined together are not globally sorted. Given that, > // the RDD partition will not be sorted even if the relation has sort > columns set > // Current solution is to check if all the buckets have a single file in it > val files = selectedPartitions.flatMap(partition => partition.files) > val bucketToFilesGrouping = > files.map(_.getPath.getName).groupBy(file => > BucketingUtils.getBucketId(file)) > val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= > 1){code} > so obviously the code avoids dealing with this situation now.. > could you think of a way to solve this or bypass it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24650) GroupingSet
[ https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24650: - Priority: Major (was: Blocker) > GroupingSet > --- > > Key: SPARK-24650 > URL: https://issues.apache.org/jira/browse/SPARK-24650 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: CDH 5.X, Spark 2.3 >Reporter: Mihir Sahu >Priority: Major > Labels: Grouping, Sets > > If a grouping set is used in spark sql, then the plan does not perform > optimally. > If input to a grouping set is X rows and the grouping sets has y group, then > the number of rows that are processed is currently x*y rows. > Example : Let a Dataframe have col1, col2, col3 and col4 columns and number > of row be rowNo. > and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2 > Number of row processed in such case is 3*(rowNos * size of each row). > However is this the optimal way of processing data. > If the groups of y are derivable for each other, can we reduce the amount of > volume processed by removing columns as we progress to the lower dimension of > processing. > Currently while doing processing percentile, a lot of data seems to be > processed causing performance issue. > Need to look if this can be optimised -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24650) GroupingSet
[ https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523351#comment-16523351 ] Hyukjin Kwon commented on SPARK-24650: -- Please avoid to set a blocker which is usually reserved for a committer. > GroupingSet > --- > > Key: SPARK-24650 > URL: https://issues.apache.org/jira/browse/SPARK-24650 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: CDH 5.X, Spark 2.3 >Reporter: Mihir Sahu >Priority: Major > Labels: Grouping, Sets > > If a grouping set is used in spark sql, then the plan does not perform > optimally. > If input to a grouping set is X rows and the grouping sets has y group, then > the number of rows that are processed is currently x*y rows. > Example : Let a Dataframe have col1, col2, col3 and col4 columns and number > of row be rowNo. > and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2 > Number of row processed in such case is 3*(rowNos * size of each row). > However is this the optimal way of processing data. > If the groups of y are derivable for each other, can we reduce the amount of > volume processed by removing columns as we progress to the lower dimension of > processing. > Currently while doing processing percentile, a lot of data seems to be > processed causing performance issue. > Need to look if this can be optimised -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24651) Add ability to write null values while writing JSON
[ https://issues.apache.org/jira/browse/SPARK-24651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523383#comment-16523383 ] Hyukjin Kwon commented on SPARK-24651: -- I think it's basically a duplicate of SPARK-23773. > Add ability to write null values while writing JSON > --- > > Key: SPARK-24651 > URL: https://issues.apache.org/jira/browse/SPARK-24651 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Matthew Liem >Priority: Minor > > Hello, > Spark is configured to ignore the null values when writing JSON based off of > JacksonMessageWriter.scala during serialization: > |mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)| > In some scenarios it is useful to maintain these fields..looking to see if > this functionality can be added or configurable to set e.g. use > Include.ALWAYS or other properties depending on requirement. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24649) SparkUDF.unapply is not backwards compatable
[ https://issues.apache.org/jira/browse/SPARK-24649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24649. -- Resolution: Invalid catalysis is considered as an internal API, and subject to change between minor releases. > SparkUDF.unapply is not backwards compatable > > > Key: SPARK-24649 > URL: https://issues.apache.org/jira/browse/SPARK-24649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Simeon H.K. Fitch >Priority: Minor > > The shape of the `ScalaUDF` case class changed in 2.3.0. A secondary > constructor that's backwards compatible with 2.1.x and 2.2.x was provided, > but a corresponding `unapply` method wasn't included. Therefore code such as > the following that worked in 2.1.x and 2.2.x no longer compiles: > {code:java} > val ScalaUDF(function, dataType, children, inputTypes, udfName) = myUDF > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523399#comment-16523399 ] Hyukjin Kwon commented on SPARK-24644: -- Can you clarify the environment, in particular, PyArrow and Pandas versions? > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24643) from_json should accept an aggregate function as schema
[ https://issues.apache.org/jira/browse/SPARK-24643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523403#comment-16523403 ] Hyukjin Kwon commented on SPARK-24643: -- SPARK-24642 is not added yet though ... > from_json should accept an aggregate function as schema > --- > > Key: SPARK-24643 > URL: https://issues.apache.org/jira/browse/SPARK-24643 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, the *from_json()* function accepts only string literals as schema: > - Checking of schema argument inside of JsonToStructs: > [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L530] > - Accepting only string literal: > [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L749-L752] > JsonToStructs should be modified to accept results of aggregate functions > like *infer_schema* (see SPARK-24642). It should be possible to write SQL > like: > {code:sql} > select from_json(json_col, infer_schema(json_col)) from json_table > {code} > Here is a test case with existing aggregate function - *first()*: > {code:sql} > create temporary view schemas(schema) as select * from values > ('struct'), > ('map'); > select from_json('{"a":1}', first(schema)) from schemas; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24647) Sink Should Return OffsetSeqs For ProgressReporting
[ https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24647: - Fix Version/s: (was: 2.4.0) > Sink Should Return OffsetSeqs For ProgressReporting > --- > > Key: SPARK-24647 > URL: https://issues.apache.org/jira/browse/SPARK-24647 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Vaclav Kosar >Priority: Major > > To be able to track data lineage for Structured Streaming (I intend to > implement this to Open Source Project Spline), the monitoring needs to be > able to not only to track where the data was read from but also where results > were written to. This could be to my knowledge best implemented using > monitoring {{StreamingQueryProgress}}. However currently batch data offsets > are not available on {{Sink}} interface. Implementing as proposed would also > bring symmetry to {{StreamingQueryProgress}} fields sources and sink. > > *Similar Proposals* > Made in following jiras. These would not be sufficient for lineage tracking. > * https://issues.apache.org/jira/browse/SPARK-18258 > * https://issues.apache.org/jira/browse/SPARK-21313 > > *Current State* > * Method {{Sink#addBatch}} returns {{Unit}}. > * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using > {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method. > {code:java} > "sources" : [ { > "description" : "KafkaSource[Subscribe[test-topic]]", > "startOffset" : null, > "endOffset" : { "test-topic" : { "0" : 5000 }}, > "numInputRows" : 5000, > "processedRowsPerSecond" : 645.3278265358803 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f" > } > {code} > > > *Proposed State* > * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying > offsets of the written batch, e.g. Kafka does it by returning > {{RecordMetadata}} object from {{send}} method. > * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion > as {{sourceProgress}}. > > > {code:java} > "sources" : [ { > "description" : "KafkaSource[Subscribe[test-topic]]", > "startOffset" : null, > "endOffset" : { "test-topic" : { "0" : 5000 }}, > "numInputRows" : 5000, > "processedRowsPerSecond" : 645.3278265358803 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f", > "startOffset" : null, > "endOffset" { "sinkTopic": { "0": 333 }} > } > {code} > > *Implementation* > * PR submitters: Likely will be me and [~wajda] as soon as the discussion > ends positively. > * {{Sinks}}: Modify all sinks to conform a new interface or return dummy > values. > * {{ProgressReporter}}: Merge offsets from different batches properly, > similarly to how it is done for sources. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24647) Sink Should Return OffsetSeqs For ProgressReporting
[ https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523405#comment-16523405 ] Hyukjin Kwon commented on SPARK-24647: -- (please avoid to set a fix version which is usually set when it's actually fixed) > Sink Should Return OffsetSeqs For ProgressReporting > --- > > Key: SPARK-24647 > URL: https://issues.apache.org/jira/browse/SPARK-24647 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Vaclav Kosar >Priority: Major > > To be able to track data lineage for Structured Streaming (I intend to > implement this to Open Source Project Spline), the monitoring needs to be > able to not only to track where the data was read from but also where results > were written to. This could be to my knowledge best implemented using > monitoring {{StreamingQueryProgress}}. However currently batch data offsets > are not available on {{Sink}} interface. Implementing as proposed would also > bring symmetry to {{StreamingQueryProgress}} fields sources and sink. > > *Similar Proposals* > Made in following jiras. These would not be sufficient for lineage tracking. > * https://issues.apache.org/jira/browse/SPARK-18258 > * https://issues.apache.org/jira/browse/SPARK-21313 > > *Current State* > * Method {{Sink#addBatch}} returns {{Unit}}. > * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using > {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method. > {code:java} > "sources" : [ { > "description" : "KafkaSource[Subscribe[test-topic]]", > "startOffset" : null, > "endOffset" : { "test-topic" : { "0" : 5000 }}, > "numInputRows" : 5000, > "processedRowsPerSecond" : 645.3278265358803 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f" > } > {code} > > > *Proposed State* > * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying > offsets of the written batch, e.g. Kafka does it by returning > {{RecordMetadata}} object from {{send}} method. > * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion > as {{sourceProgress}}. > > > {code:java} > "sources" : [ { > "description" : "KafkaSource[Subscribe[test-topic]]", > "startOffset" : null, > "endOffset" : { "test-topic" : { "0" : 5000 }}, > "numInputRows" : 5000, > "processedRowsPerSecond" : 645.3278265358803 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f", > "startOffset" : null, > "endOffset" { "sinkTopic": { "0": 333 }} > } > {code} > > *Implementation* > * PR submitters: Likely will be me and [~wajda] as soon as the discussion > ends positively. > * {{Sinks}}: Modify all sinks to conform a new interface or return dummy > values. > * {{ProgressReporter}}: Merge offsets from different batches properly, > similarly to how it is done for sources. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)
[ https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523417#comment-16523417 ] Hyukjin Kwon commented on SPARK-24570: -- So you are saying {code} == SQL == SHOW TABLE EXTENDED FROM sit1_pb LIKE `*` --^^^ {code} doesn't work in Spark? > SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel > SQL, DBVisualizer.etc) > --- > > Key: SPARK-24570 > URL: https://issues.apache.org/jira/browse/SPARK-24570 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: t oo >Priority: Major > Attachments: connect-to-sql-db-ssms-locate-table.png > > > An end-user SQL client tool (ie in the screenshot) can list tables from > hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with > SparkSQL it does not display any tables. This would be very convenient for > users. > This is the exception in the client tool (Aqua Data Studio): > {code:java} > Title: An Error Occurred > Summary: Unable to Enumerate Result > Start Message > > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '`*`' expecting STRING(line 1, pos 38) > == SQL == > SHOW TABLE EXTENDED FROM sit1_pb LIKE `*` > --^^^ > End Message > > Start Stack Trace > > java.sql.SQLException: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '`*`' expecting STRING(line 1, pos 38) > == SQL == > SHOW TABLE EXTENDED FROM sit1_pb LIKE `*` > --^^^ > at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296) > at com.aquafold.aquacore.open.rdbms.drivers.hive.Qꐨꈬꈦꁐ.execute(Unknown > Source) > at \\.\\.\\हिñçêČάй語简�?한\\.gᚵ᠃᠍ꃰint.execute(Unknown Source) > at com.common.ui.tree.hꐊᠱꇗꇐ9int.yW(Unknown Source) > at com.common.ui.tree.hꐊᠱꇗꇐ9int$1.process(Unknown Source) > at com.common.ui.util.BackgroundThread.run(Unknown Source) > End Stack Trace > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523431#comment-16523431 ] Hyukjin Kwon commented on SPARK-24530: -- macOS, Python 2.7.14, Sphinx 1.4.1 shows: {code} class pyspark.ml.classification.LogisticRegression(*args, **kwargs)[source] Logistic regression. This class supports multinomial logistic (softmax) and binomial logistic regression. {code} > pyspark.ml doesn't generate class docs correctly > > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523433#comment-16523433 ] Hyukjin Kwon commented on SPARK-24530: -- I have another computer: macOS, Python 2.7.14, Sphinx 1.7.2 shows: {code} class pyspark.ml.classification.LogisticRegression(*args, **kwargs)[source] Logistic regression. This class supports multinomial logistic (softmax) and binomial logistic regression. {code} I think we need [~dongjoon]'s input. > pyspark.ml doesn't generate class docs correctly > > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child
[ https://issues.apache.org/jira/browse/SPARK-24458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523443#comment-16523443 ] Hyukjin Kwon commented on SPARK-24458: -- I usually just checkout on the tag, for example, {{git checkout v2.3.1}}, and build it .. but I usually just download minor versions and check it frankly because usually maintenance release doesn't have a big change. > Invalid PythonUDF check_1(), requires attributes from more than one child > - > > Key: SPARK-24458 > URL: https://issues.apache.org/jira/browse/SPARK-24458 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Spark 2.3.0 (local mode) > Mac OSX >Reporter: Abdeali Kothari >Priority: Major > > I was trying out a very large query execution plan I have and I got the error: > > {code:java} > py4j.protocol.Py4JJavaError: An error occurred while calling > o359.simpleString. > : java.lang.RuntimeException: Invalid PythonUDF check_1(), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMeth
[jira] [Assigned] (SPARK-22425) add output files information to EventLogger
[ https://issues.apache.org/jira/browse/SPARK-22425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22425: Assignee: Apache Spark > add output files information to EventLogger > --- > > Key: SPARK-22425 > URL: https://issues.apache.org/jira/browse/SPARK-22425 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Long Tian >Assignee: Apache Spark >Priority: Major > Labels: patch > > We can get all the input files from *EventLogger* when > *spark.eventLog.enabled* is *true*. But there's no output files information. > Is it possible to add some output files information to *EventLogger*? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22425) add output files information to EventLogger
[ https://issues.apache.org/jira/browse/SPARK-22425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523459#comment-16523459 ] Apache Spark commented on SPARK-22425: -- User 'voidfunction' has created a pull request for this issue: https://github.com/apache/spark/pull/21642 > add output files information to EventLogger > --- > > Key: SPARK-22425 > URL: https://issues.apache.org/jira/browse/SPARK-22425 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Long Tian >Priority: Major > Labels: patch > > We can get all the input files from *EventLogger* when > *spark.eventLog.enabled* is *true*. But there's no output files information. > Is it possible to add some output files information to *EventLogger*? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22425) add output files information to EventLogger
[ https://issues.apache.org/jira/browse/SPARK-22425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22425: Assignee: (was: Apache Spark) > add output files information to EventLogger > --- > > Key: SPARK-22425 > URL: https://issues.apache.org/jira/browse/SPARK-22425 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Long Tian >Priority: Major > Labels: patch > > We can get all the input files from *EventLogger* when > *spark.eventLog.enabled* is *true*. But there's no output files information. > Is it possible to add some output files information to *EventLogger*? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child
[ https://issues.apache.org/jira/browse/SPARK-24458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523475#comment-16523475 ] Ruben Berenguel commented on SPARK-24458: - Oh, big facepalm, thanks [~hyukjin.kwon]. My autocomplete script completes branches but not tags, so could not find 2.3.0 and I've been so long using it I expect it to always work. I need to fix that :D. I'll check reproducibility and try to find the change that fixed this for completeness > Invalid PythonUDF check_1(), requires attributes from more than one child > - > > Key: SPARK-24458 > URL: https://issues.apache.org/jira/browse/SPARK-24458 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Spark 2.3.0 (local mode) > Mac OSX >Reporter: Abdeali Kothari >Priority: Major > > I was trying out a very large query execution plan I have and I got the error: > > {code:java} > py4j.protocol.Py4JJavaError: An error occurred while calling > o359.simpleString. > : java.lang.RuntimeException: Invalid PythonUDF check_1(), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187) > at sun.reflec
[jira] [Commented] (SPARK-24347) df.alias() in python API should not clear metadata by default
[ https://issues.apache.org/jira/browse/SPARK-24347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523476#comment-16523476 ] Ruben Berenguel commented on SPARK-24347: - Pinging [~hyukjin.kwon], too :) > df.alias() in python API should not clear metadata by default > - > > Key: SPARK-24347 > URL: https://issues.apache.org/jira/browse/SPARK-24347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Tomasz Bartczak >Priority: Minor > > currently when doing an alias on a column in pyspark I lose metadata: > {code:java} > print("just select = ", df.select(col("v")).schema.fields[0].metadata.keys()) > print("select alias= ", > df.select(col("v").alias("vv")).schema.fields[0].metadata.keys()){code} > gives: > {code:java} > just select = dict_keys(['ml_attr']) > select alias= dict_keys([]){code} > After looking at alias() documentation I see that metadata is an optional > param. But it should not clear the metadata when it is not set. A default > solution should be to keep it as-is. > Otherwise - it generates problems in a later part of the processing pipeline > when someone is depending on the metadata. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18649) sc.textFile(my_file).collect() raises socket.timeout on large files
[ https://issues.apache.org/jira/browse/SPARK-18649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523479#comment-16523479 ] Andrei Gorlanov commented on SPARK-18649: - Hello, I am going to take care of it. > sc.textFile(my_file).collect() raises socket.timeout on large files > --- > > Key: SPARK-18649 > URL: https://issues.apache.org/jira/browse/SPARK-18649 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: PySpark version 1.6.2 >Reporter: Erik Cederstrand >Priority: Major > > I'm trying to load a file into the driver with this code: > contents = sc.textFile('hdfs://path/to/big_file.csv').collect() > Loading into the driver instead of creating a distributed RDD is intentional > in this case. The file is ca. 6GB, and I have adjusted driver memory > accordingly to fit the local data. After some time, my spark/submitted job > crashes with the stack trace below. > I have traced this to pyspark/rdd.py where the _load_from_socket() method > creates a socket with a hard-coded timeout of 3 seconds (this code is also > present in HEAD although I'm on PySpark 1.6.2). Raising this hard-coded value > to e.g. 600 lets me read the entire file. > Is there any reason that this value does not use e.g. the > 'spark.network.timeout' setting instead? > Traceback (most recent call last): > File "my_textfile_test.py", line 119, in > contents = sc.textFile('hdfs://path/to/file.csv').collect() > File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 772, in collect > File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 142, in _load_from_socket > File > "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 517, in load_stream > File > "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 511, in loads > File "/usr/lib/python2.7/socket.py", line 380, in read > data = self._sock.recv(left) > socket.timeout: timed out > 16/11/30 13:33:14 WARN Utils: Suppressing exception in finally: Broken pipe > java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) > at java.net.SocketOutputStream.write(SocketOutputStream.java:153) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at java.io.FilterOutputStream.close(FilterOutputStream.java:158) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$2.apply$mcV$sp(PythonRDD.scala:650) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1248) > at > org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649) > Suppressed: java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) > at > java.net.SocketOutputStream.write(SocketOutputStream.java:153) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.FilterOutputStream.close(FilterOutputStream.java:158) > at java.io.FilterOutputStream.close(FilterOutputStream.java:159) > ... 3 more > 16/11/30 13:33:14 ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) > at java.net.SocketOutputStream.write(SocketOutputStream.java:153) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.a
[jira] [Created] (SPARK-24659) GenericArrayData.equals should respect element type differences
Kris Mok created SPARK-24659: Summary: GenericArrayData.equals should respect element type differences Key: SPARK-24659 URL: https://issues.apache.org/jira/browse/SPARK-24659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.3.0, 2.4.0 Reporter: Kris Mok Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect element type differences, due to a caveat in Scala's {{==}} operator. e.g. {{new GenericArrayData(Array[Int](123)).equals(new GenericArrayData(Array[Long](123L)))}} currently returns true. But that's against the semantics of Spark SQL's array type, where {{array}} and {{array}} are considered to be incompatible types and thus should never be equal. This ticket proposes to fix the implementation of {{GenericArrayData.equals}} so that it's more aligned to Spark SQL's array type semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24659) GenericArrayData.equals should respect element type differences
[ https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523578#comment-16523578 ] Apache Spark commented on SPARK-24659: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/21643 > GenericArrayData.equals should respect element type differences > --- > > Key: SPARK-24659 > URL: https://issues.apache.org/jira/browse/SPARK-24659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Kris Mok >Priority: Major > > Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect > element type differences, due to a caveat in Scala's {{==}} operator. > e.g. {{new GenericArrayData(Array[Int](123)).equals(new > GenericArrayData(Array[Long](123L)))}} currently returns true. But that's > against the semantics of Spark SQL's array type, where {{array}} and > {{array}} are considered to be incompatible types and thus should never > be equal. > This ticket proposes to fix the implementation of {{GenericArrayData.equals}} > so that it's more aligned to Spark SQL's array type semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24659) GenericArrayData.equals should respect element type differences
[ https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24659: Assignee: (was: Apache Spark) > GenericArrayData.equals should respect element type differences > --- > > Key: SPARK-24659 > URL: https://issues.apache.org/jira/browse/SPARK-24659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Kris Mok >Priority: Major > > Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect > element type differences, due to a caveat in Scala's {{==}} operator. > e.g. {{new GenericArrayData(Array[Int](123)).equals(new > GenericArrayData(Array[Long](123L)))}} currently returns true. But that's > against the semantics of Spark SQL's array type, where {{array}} and > {{array}} are considered to be incompatible types and thus should never > be equal. > This ticket proposes to fix the implementation of {{GenericArrayData.equals}} > so that it's more aligned to Spark SQL's array type semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24659) GenericArrayData.equals should respect element type differences
[ https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24659: Assignee: Apache Spark > GenericArrayData.equals should respect element type differences > --- > > Key: SPARK-24659 > URL: https://issues.apache.org/jira/browse/SPARK-24659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Major > > Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect > element type differences, due to a caveat in Scala's {{==}} operator. > e.g. {{new GenericArrayData(Array[Int](123)).equals(new > GenericArrayData(Array[Long](123L)))}} currently returns true. But that's > against the semantics of Spark SQL's array type, where {{array}} and > {{array}} are considered to be incompatible types and thus should never > be equal. > This ticket proposes to fix the implementation of {{GenericArrayData.equals}} > so that it's more aligned to Spark SQL's array type semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24660) SHS is not showing properly errors when downloading logs
Marco Gaido created SPARK-24660: --- Summary: SHS is not showing properly errors when downloading logs Key: SPARK-24660 URL: https://issues.apache.org/jira/browse/SPARK-24660 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1 Reporter: Marco Gaido The History Server is not showing properly errors which happen when trying to download logs. In particular, when downloading logs for which the user is not authorized, the user sees a File not found error, instead of the unauthorized response. Similarly, trying to download logs from a non-existing application returns a server error, instead of a 404 message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24660) SHS is not showing properly errors when downloading logs
[ https://issues.apache.org/jira/browse/SPARK-24660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24660: Assignee: (was: Apache Spark) > SHS is not showing properly errors when downloading logs > > > Key: SPARK-24660 > URL: https://issues.apache.org/jira/browse/SPARK-24660 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Marco Gaido >Priority: Major > > The History Server is not showing properly errors which happen when trying to > download logs. In particular, when downloading logs for which the user is not > authorized, the user sees a File not found error, instead of the unauthorized > response. > Similarly, trying to download logs from a non-existing application returns a > server error, instead of a 404 message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24660) SHS is not showing properly errors when downloading logs
[ https://issues.apache.org/jira/browse/SPARK-24660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523788#comment-16523788 ] Apache Spark commented on SPARK-24660: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21644 > SHS is not showing properly errors when downloading logs > > > Key: SPARK-24660 > URL: https://issues.apache.org/jira/browse/SPARK-24660 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Marco Gaido >Priority: Major > > The History Server is not showing properly errors which happen when trying to > download logs. In particular, when downloading logs for which the user is not > authorized, the user sees a File not found error, instead of the unauthorized > response. > Similarly, trying to download logs from a non-existing application returns a > server error, instead of a 404 message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24660) SHS is not showing properly errors when downloading logs
[ https://issues.apache.org/jira/browse/SPARK-24660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24660: Assignee: Apache Spark > SHS is not showing properly errors when downloading logs > > > Key: SPARK-24660 > URL: https://issues.apache.org/jira/browse/SPARK-24660 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Major > > The History Server is not showing properly errors which happen when trying to > download logs. In particular, when downloading logs for which the user is not > authorized, the user sees a File not found error, instead of the unauthorized > response. > Similarly, trying to download logs from a non-existing application returns a > server error, instead of a 404 message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24661) Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeException
David Mavashev created SPARK-24661: -- Summary: Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeException Key: SPARK-24661 URL: https://issues.apache.org/jira/browse/SPARK-24661 Project: Spark Issue Type: Bug Components: DStreams, Java API, PySpark Affects Versions: 2.3.0 Reporter: David Mavashev Steps to reproduce: Creating a data set: {code:java} List simpleWindowColumns = new ArrayList(); simpleWindowColumns.add("column1"); simpleWindowColumns.add("column2"); Map expressionsWithAliasesEntrySet = new HashMap); expressionsWithAliasesEntrySet.put("count(id)", "count_column"); DataFrameReader reader = sparkSession.read().format("csv"); Dataset sparkDataSet = reader.option("header", "true").load("/path/to/data/data.csv"); //Invoking cached: sparkDataSet = sparkDataSet.cached() //Creating window spec with 2 columns: WindowSpec window = Window.partitionBy(JavaConverters.asScalaIteratorConverter(simpleWindowColumns.stream().map(item->sparkDataSet.col(item)).iterator()).asScala().toSeq()); sparkDataSet = sparkDataSet.withColumns(JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->item.getKey()).collect(Collectors.toList()).iterator()).asScala().toSeq(), JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->new Column(item.getValue()).over(finalWindow)).collect(Collectors.toList()).iterator()).asScala().toSeq()); sparkDataSet.show();{code} Expected: Results are shown Actual: the following exception is thrown {code:java} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: windowspecdefinition(O003#3, O006#6, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan$.normalizeExprId(QueryPlan.scala:288) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:232) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:226) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:226) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPl
[jira] [Updated] (SPARK-24661) Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeException
[ https://issues.apache.org/jira/browse/SPARK-24661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mavashev updated SPARK-24661: --- Description: Steps to reproduce: Creating a data set: {code:java} List simpleWindowColumns = new ArrayList(); simpleWindowColumns.add("column1"); simpleWindowColumns.add("column2"); Map expressionsWithAliasesEntrySet = new HashMap); expressionsWithAliasesEntrySet.put("count(id)", "count_column"); DataFrameReader reader = sparkSession.read().format("csv"); Dataset sparkDataSet = reader.option("header", "true").load("/path/to/data/data.csv"); //Invoking cached: sparkDataSet = sparkDataSet.cache() //Creating window spec with 2 columns: WindowSpec window = Window.partitionBy(JavaConverters.asScalaIteratorConverter(simpleWindowColumns.stream().map(item->sparkDataSet.col(item)).iterator()).asScala().toSeq()); sparkDataSet = sparkDataSet.withColumns(JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->item.getKey()).collect(Collectors.toList()).iterator()).asScala().toSeq(), JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->new Column(item.getValue()).over(finalWindow)).collect(Collectors.toList()).iterator()).asScala().toSeq()); sparkDataSet.show();{code} Expected: Results are shown Actual: the following exception is thrown {code:java} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: windowspecdefinition(O003#3, O006#6, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan$.normalizeExprId(QueryPlan.scala:288) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:232) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:226) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:226) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$ma
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523888#comment-16523888 ] Hari Sekhon commented on SPARK-6305: Log4j 2.x would really help with Spark logging integration to ELK as there are a lot of things that just don't work properly in Log4j 1.x like layout.ConversionPattern for constructing JSON enriched logs, such as logging user and app names to distinguish jobs and provide much needed search usability. This is simply ignored in the SocketAppender in Log4j 1.x :-/ while SyslogAppender respects ConversionPattern but then splits all Java Exceptions in to multiple syslog logs so the JSON no longer parses and routes to the right indices for the Yarn queue, nor can you reassemble the exception logs using multiline codec at the other end as you'd end up with corrupted input streams from multiple loggers) :-/ Running Filebeats everywhere instead seems like overkill compared to being able to enable logging for debugging jobs on an ad-hoc basis to a Logstash sink that works using much better Log4j 2.x output appenders. I hope someone finally manages to sort this out as it's years overdue given Log4j 1.x was end of life 3 years ago and there is a big jump in capabilities between Log4j 1.x and 2.x, both in the number of appenders as well as completeness of even the old appenders such as the SocketAppender as mentioned above. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523888#comment-16523888 ] Hari Sekhon edited comment on SPARK-6305 at 6/26/18 3:47 PM: - Log4j 2.x would really help with Spark logging integration to ELK as there are a lot of things that just don't work properly in Log4j 1.x like layout.ConversionPattern for constructing JSON enriched logs, such as logging user and app names to distinguish jobs and provide much needed search usability. This is simply ignored in the SocketAppender in Log4j 1.x, while SyslogAppender respects ConversionPattern but then splits all Java Exceptions in to multiple syslog logs so the JSON no longer parses and routes to the right indices for the Yarn queue, nor can you reassemble the exception logs using multiline codec at the other end as you'd end up with corrupted input streams from multiple loggers) :-/ Running Filebeats everywhere instead seems like overkill compared to being able to enable logging for debugging jobs on an ad-hoc basis to a Logstash sink that works using much better Log4j 2.x output appenders. I hope someone finally manages to sort this out as it's years overdue given Log4j 1.x was end of life 3 years ago and there is a big jump in capabilities between Log4j 1.x and 2.x, both in the number of appenders as well as completeness of even the old appenders such as the SocketAppender as mentioned above. was (Author: harisekhon): Log4j 2.x would really help with Spark logging integration to ELK as there are a lot of things that just don't work properly in Log4j 1.x like layout.ConversionPattern for constructing JSON enriched logs, such as logging user and app names to distinguish jobs and provide much needed search usability. This is simply ignored in the SocketAppender in Log4j 1.x :-/ while SyslogAppender respects ConversionPattern but then splits all Java Exceptions in to multiple syslog logs so the JSON no longer parses and routes to the right indices for the Yarn queue, nor can you reassemble the exception logs using multiline codec at the other end as you'd end up with corrupted input streams from multiple loggers) :-/ Running Filebeats everywhere instead seems like overkill compared to being able to enable logging for debugging jobs on an ad-hoc basis to a Logstash sink that works using much better Log4j 2.x output appenders. I hope someone finally manages to sort this out as it's years overdue given Log4j 1.x was end of life 3 years ago and there is a big jump in capabilities between Log4j 1.x and 2.x, both in the number of appenders as well as completeness of even the old appenders such as the SocketAppender as mentioned above. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24653) Flaky test "JoinSuite.test SortMergeJoin (with spill)"
[ https://issues.apache.org/jira/browse/SPARK-24653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24653: Assignee: (was: Apache Spark) > Flaky test "JoinSuite.test SortMergeJoin (with spill)" > -- > > Key: SPARK-24653 > URL: https://issues.apache.org/jira/browse/SPARK-24653 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > We've run into failures in this test in our internal jobs a few times. They > look like this: > {noformat} > java.lang.AssertionError: assertion failed: expected full outer join to not > spill, but did > at scala.Predef$.assert(Predef.scala:170) > at org.apache.spark.TestUtils$.assertNotSpilled(TestUtils.scala:189) > at > org.apache.spark.sql.JoinSuite$$anonfun$23$$anonfun$apply$mcV$sp$16.apply$mcV$sp(JoinSuite.scala:734) > at > org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:108) > {noformat} > I looked on the riselab jenkins and couldn't find a failure, so filing with a > low priority. > I did notice a possible race in the code that could explain the failure. Will > send a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24653) Flaky test "JoinSuite.test SortMergeJoin (with spill)"
[ https://issues.apache.org/jira/browse/SPARK-24653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24653: Assignee: Apache Spark > Flaky test "JoinSuite.test SortMergeJoin (with spill)" > -- > > Key: SPARK-24653 > URL: https://issues.apache.org/jira/browse/SPARK-24653 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > We've run into failures in this test in our internal jobs a few times. They > look like this: > {noformat} > java.lang.AssertionError: assertion failed: expected full outer join to not > spill, but did > at scala.Predef$.assert(Predef.scala:170) > at org.apache.spark.TestUtils$.assertNotSpilled(TestUtils.scala:189) > at > org.apache.spark.sql.JoinSuite$$anonfun$23$$anonfun$apply$mcV$sp$16.apply$mcV$sp(JoinSuite.scala:734) > at > org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:108) > {noformat} > I looked on the riselab jenkins and couldn't find a failure, so filing with a > low priority. > I did notice a possible race in the code that could explain the failure. Will > send a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523918#comment-16523918 ] Marcelo Vanzin commented on SPARK-24631: Sorry for the noise, pasted the wrong bug number in my PR. > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24631: -- Assignee: (was: Marcelo Vanzin) > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24631: -- Assignee: Marcelo Vanzin > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Assignee: Marcelo Vanzin >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24653) Flaky test "JoinSuite.test SortMergeJoin (with spill)"
[ https://issues.apache.org/jira/browse/SPARK-24653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523916#comment-16523916 ] Apache Spark commented on SPARK-24653: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21639 > Flaky test "JoinSuite.test SortMergeJoin (with spill)" > -- > > Key: SPARK-24653 > URL: https://issues.apache.org/jira/browse/SPARK-24653 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > We've run into failures in this test in our internal jobs a few times. They > look like this: > {noformat} > java.lang.AssertionError: assertion failed: expected full outer join to not > spill, but did > at scala.Predef$.assert(Predef.scala:170) > at org.apache.spark.TestUtils$.assertNotSpilled(TestUtils.scala:189) > at > org.apache.spark.sql.JoinSuite$$anonfun$23$$anonfun$apply$mcV$sp$16.apply$mcV$sp(JoinSuite.scala:734) > at > org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:108) > {noformat} > I looked on the riselab jenkins and couldn't find a failure, so filing with a > low priority. > I did notice a possible race in the code that could explain the failure. Will > send a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24631: --- Comment: was deleted (was: User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21639) > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct
[ https://issues.apache.org/jira/browse/SPARK-24537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24537: Assignee: (was: Apache Spark) > Add array_remove / array_zip / map_from_arrays / array_distinct > --- > > Key: SPARK-24537 > URL: https://issues.apache.org/jira/browse/SPARK-24537 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add R versions of > * array_remove -SPARK-23920- > * array_zip -SPARK-23931- > * map_from_arrays -SPARK-23933- > * array_distinct -SPARK-23912- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct
[ https://issues.apache.org/jira/browse/SPARK-24537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24537: Assignee: Apache Spark > Add array_remove / array_zip / map_from_arrays / array_distinct > --- > > Key: SPARK-24537 > URL: https://issues.apache.org/jira/browse/SPARK-24537 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Add R versions of > * array_remove -SPARK-23920- > * array_zip -SPARK-23931- > * map_from_arrays -SPARK-23933- > * array_distinct -SPARK-23912- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct
[ https://issues.apache.org/jira/browse/SPARK-24537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524114#comment-16524114 ] Apache Spark commented on SPARK-24537: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21645 > Add array_remove / array_zip / map_from_arrays / array_distinct > --- > > Key: SPARK-24537 > URL: https://issues.apache.org/jira/browse/SPARK-24537 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add R versions of > * array_remove -SPARK-23920- > * array_zip -SPARK-23931- > * map_from_arrays -SPARK-23933- > * array_distinct -SPARK-23912- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24658) Remove workaround for ANTLR bug
[ https://issues.apache.org/jira/browse/SPARK-24658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24658. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.4.0 > Remove workaround for ANTLR bug > --- > > Key: SPARK-24658 > URL: https://issues.apache.org/jira/browse/SPARK-24658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Critical > Fix For: 2.4.0 > > > Issue [antlr/antlr4#781|https://github.com/antlr/antlr4/issues/781] has > already been fixed, so the workaround of extracting the pattern into a > separate rule is no longer needed. The presto already removed it: > https://github.com/prestodb/presto/pull/10744. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524233#comment-16524233 ] Dongjoon Hyun commented on SPARK-24530: --- [~mengxr] and [~hyukjin.kwon]. My environment is macOS, *python 3*, Sphinx v1.6.3. {code} ~/s/p/docs:master$ make html sphinx-build -b html -d _build/doctrees . _build/html Running Sphinx v1.6.3 making output directory... ... {code} According to the above reports, many combinations of Python 2.7 and Sphinx looks broken. > pyspark.ml doesn't generate class docs correctly > > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24530) pyspark.ml doesn't generate class docs correctly
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524233#comment-16524233 ] Dongjoon Hyun edited comment on SPARK-24530 at 6/26/18 9:49 PM: [~mengxr] and [~hyukjin.kwon]. My environment is macOS, *python 3*, Sphinx v1.6.3. {code} ~/s/p/docs:master$ make html sphinx-build -b html -d _build/doctrees . _build/html Running Sphinx v1.6.3 making output directory... ... {code} According to the above reports, many combinations of Python 2.7 and Sphinx looks broken? was (Author: dongjoon): [~mengxr] and [~hyukjin.kwon]. My environment is macOS, *python 3*, Sphinx v1.6.3. {code} ~/s/p/docs:master$ make html sphinx-build -b html -d _build/doctrees . _build/html Running Sphinx v1.6.3 making output directory... ... {code} According to the above reports, many combinations of Python 2.7 and Sphinx looks broken. > pyspark.ml doesn't generate class docs correctly > > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24423) Add a new option `query` for JDBC sources
[ https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24423. - Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.4.0 > Add a new option `query` for JDBC sources > - > > Key: SPARK-24423 > URL: https://issues.apache.org/jira/browse/SPARK-24423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > Currently, our JDBC connector provides the option `dbtable` for users to > specify the to-be-loaded JDBC source table. > {code} > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", "dbName.tableName") > .options(jdbcCredentials: Map) > .load() > {code} > Normally, users do not fetch the whole JDBC table due to the poor > performance/throughput of JDBC. Thus, they normally just fetch a small set of > tables. For advanced users, they can pass a subquery as the option. > {code} > val query = """ (select * from tableName limit 10) as tmp """ > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", query) > .options(jdbcCredentials: Map) > .load() > {code} > However, this is straightforward to end users. We should simply allow users > to specify the query by a new option `query`. We will handle the complexity > for them. > {code} > val query = """select * from tableName limit 10""" > val jdbcDf = spark.read > .format("jdbc") > .option("*{color:#ff}query{color}*", query) > .options(jdbcCredentials: Map) > .load() > {code} > Users are not allowed to specify query and dbtable at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24662) Structured Streaming should support LIMIT
Mukul Murthy created SPARK-24662: Summary: Structured Streaming should support LIMIT Key: SPARK-24662 URL: https://issues.apache.org/jira/browse/SPARK-24662 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Mukul Murthy Make structured streams support the LIMIT operator. This will undo SPARK-24525 as the limit operator would be a superior solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6237) Support uploading blocks > 2GB as a stream
[ https://issues.apache.org/jira/browse/SPARK-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-6237. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21346 [https://github.com/apache/spark/pull/21346] > Support uploading blocks > 2GB as a stream > -- > > Key: SPARK-6237 > URL: https://issues.apache.org/jira/browse/SPARK-6237 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6237) Support uploading blocks > 2GB as a stream
[ https://issues.apache.org/jira/browse/SPARK-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-6237: - Assignee: Imran Rashid > Support uploading blocks > 2GB as a stream > -- > > Key: SPARK-6237 > URL: https://issues.apache.org/jira/browse/SPARK-6237 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24208) Cannot resolve column in self join after applying Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-24208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524341#comment-16524341 ] Stu (Michael Stewart) commented on SPARK-24208: --- [~hyukjin.kwon] I can confirm I ran into this issue too. The issue, as the OP noted, stems from having a pandas GROUPED_MAP UDF applied to a DF prior to attempting a self-join of said DF against itself. Beyond that I've not investigated. > Cannot resolve column in self join after applying Pandas UDF > > > Key: SPARK-24208 > URL: https://issues.apache.org/jira/browse/SPARK-24208 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: AWS EMR 5.13.0 > Amazon Hadoop distribution 2.8.3 > Spark 2.3.0 > Pandas 0.22.0 >Reporter: Rafal Ganczarek >Priority: Minor > > I noticed that after applying Pandas UDF function, a self join of resulted > DataFrame will fail to resolve columns. The workaround that I found is to > recreate DataFrame with its RDD and schema. > Below you can find a Python code that reproduces the issue. > {code:java} > from pyspark import Row > import pyspark.sql.functions as F > @F.pandas_udf('key long, col string', F.PandasUDFType.GROUPED_MAP) > def dummy_pandas_udf(df): > return df[['key','col']] > df = spark.createDataFrame([Row(key=1,col='A'), Row(key=1,col='B'), > Row(key=2,col='C')]) > # transformation that causes the issue > df = df.groupBy('key').apply(dummy_pandas_udf) > # WORKAROUND that fixes the issue > # df = spark.createDataFrame(df.rdd, df.schema) > df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == > F.col('temp1.key')).show() > {code} > If workaround line is commented out, then above code fails with the following > error: > {code:java} > AnalysisExceptionTraceback (most recent call last) > in () > 12 # df = spark.createDataFrame(df.rdd, df.schema) > 13 > ---> 14 df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == > F.col('temp1.key')).show() > /usr/lib/spark/python/pyspark/sql/dataframe.py in join(self, other, on, how) > 929 on = self._jseq([]) > 930 assert isinstance(how, basestring), "how should be > basestring" > --> 931 jdf = self._jdf.join(other._jdf, on, how) > 932 return DataFrame(jdf, self.sql_ctx) > 933 > /usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, > *args) >1158 answer = self.gateway_client.send_command(command) >1159 return_value = get_return_value( > -> 1160 answer, self.gateway_client, self.target_id, self.name) >1161 >1162 for temp_arg in temp_args: > /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 67 > e.java_exception.getStackTrace())) > 68 if s.startswith('org.apache.spark.sql.AnalysisException: > '): > ---> 69 raise AnalysisException(s.split(': ', 1)[1], > stackTrace) > 70 if s.startswith('org.apache.spark.sql.catalyst.analysis'): > 71 raise AnalysisException(s.split(': ', 1)[1], > stackTrace) > AnalysisException: u"cannot resolve '`temp0.key`' given input columns: > [temp0.key, temp0.col];;\n'Join Inner, ('temp0.key = 'temp1.key)\n:- > AnalysisBarrier\n: +- SubqueryAlias temp0\n:+- > FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), > [key#4104L, col#4105]\n: +- Project [key#4099L, col#4098, > key#4099L]\n: +- LogicalRDD [col#4098, key#4099L], false\n+- > AnalysisBarrier\n +- SubqueryAlias temp1\n +- > FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), > [key#4104L, col#4105]\n+- Project [key#4099L, col#4098, > key#4099L]\n +- LogicalRDD [col#4098, key#4099L], false\n" > {code} > The same happens, if instead of DataFrame API I use Spark SQL to do a self > join: > {code:java} > # df is a DataFrame after applying dummy_pandas_udf > df.createOrReplaceTempView('df') > spark.sql(''' > SELECT > * > FROM df temp0 > LEFT JOIN df temp1 ON > temp0.key == temp1.key > ''').show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"
Marcelo Vanzin created SPARK-24663: -- Summary: Flaky test: StreamingContextSuite "stop slow receiver gracefully" Key: SPARK-24663 URL: https://issues.apache.org/jira/browse/SPARK-24663 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.4.0 Reporter: Marcelo Vanzin This is another test that sometimes fails on our build machines, although I can't find failures on the riselab jenkins servers. Failure looks like: {noformat} org.scalatest.exceptions.TestFailedException: 0 was not greater than 0 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) {noformat} The test fails in about 2s, while a successful run generally takes 15s. Looking at the logs, the receiver hasn't even started when things fail, which points at a race during test initialization. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perry Chu updated SPARK-24447: -- Priority: Minor (was: Major) > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Minor > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) # > print(sims.entries.context) # PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --- > AttributeError Traceback (most recent call last) > in () > > 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perry Chu updated SPARK-24447: -- Description: The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears to be losing track of the spark context if spark is stopped and restarted in pyspark. I'm pretty new to spark - not sure if the problem is on the python side or the scala side - would appreciate someone more experienced taking a look. This snippet should reproduce the error: {code:java} from pyspark.mllib.linalg.distributed import RowMatrix rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) matrix = RowMatrix(rows) sims = matrix.columnSimilarities() ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) print(sims.numRows(),sims.numCols()) ## This throws an error (stack trace below) print(sims.entries.first()) ## Later I tried this print(rows.context) # print(sims.entries.context) #, then throws an error{code} Error stack trace {code:java} --- AttributeError Traceback (most recent call last) in () > 1 sims.entries.first() /usr/lib/spark/python/pyspark/rdd.py in first(self) 1374 ValueError: RDD is empty 1375 """ -> 1376 rs = self.take(1) 1377 if rs: 1378 return rs[0] /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1356 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) 1359 1360 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 999 # SparkContext#runJob. 1000 mappedRDD = rdd.mapPartitions(partitionFunc) -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 1003 AttributeError: 'NoneType' object has no attribute 'sc' {code} PySpark columnSimilarities documentation [http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities] was: The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears to be losing track of the spark context. I'm pretty new to spark - not sure if the problem is on the python side or the scala side - would appreciate someone more experienced taking a look. This snippet should reproduce the error: {code:java} from pyspark.mllib.linalg.distributed import RowMatrix rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) matrix = RowMatrix(rows) sims = matrix.columnSimilarities() ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) print(sims.numRows(),sims.numCols()) ## This throws an error (stack trace below) print(sims.entries.first()) ## Later I tried this print(rows.context) # print(sims.entries.context) #, then throws an error{code} Error stack trace {code:java} --- AttributeError Traceback (most recent call last) in () > 1 sims.entries.first() /usr/lib/spark/python/pyspark/rdd.py in first(self) 1374 ValueError: RDD is empty 1375 """ -> 1376 rs = self.take(1) 1377 if rs: 1378 return rs[0] /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1356 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) 1359 1360 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 999 # SparkContext#runJob. 1000 mappedRDD = rdd.mapPartitions(partitionFunc) -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 1003 AttributeError: 'NoneType' object has no attribute 'sc' {code} PySpark columnSimilarities documentation http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Minor > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context if spark is stopped and restarted in pyspark. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = spark.spark
[jira] [Updated] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perry Chu updated SPARK-24447: -- Description: The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears to be losing track of the spark context if spark is stopped and restarted in pyspark. I'm pretty new to spark - not sure if the problem is on the python side or the scala side - would appreciate someone more experienced taking a look. This snippet should reproduce the error: {code:java} import pyspark from pyspark.mllib.linalg.distributed import RowMatrix spark.stop() spark = pyspark.sql.SparkSession.builder.getOrCreate() rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) matrix = RowMatrix(rows) sims = matrix.columnSimilarities() ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) print(sims.numRows(),sims.numCols()) ## This throws an error (stack trace below) print(sims.entries.first()) ## Later I tried this print(rows.context) # print(sims.entries.context) #, then throws an error{code} Error stack trace {code:java} --- AttributeError Traceback (most recent call last) in () > 1 sims.entries.first() /usr/lib/spark/python/pyspark/rdd.py in first(self) 1374 ValueError: RDD is empty 1375 """ -> 1376 rs = self.take(1) 1377 if rs: 1378 return rs[0] /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1356 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) 1359 1360 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 999 # SparkContext#runJob. 1000 mappedRDD = rdd.mapPartitions(partitionFunc) -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 1003 AttributeError: 'NoneType' object has no attribute 'sc' {code} PySpark columnSimilarities documentation [http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities] was: The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears to be losing track of the spark context if spark is stopped and restarted in pyspark. I'm pretty new to spark - not sure if the problem is on the python side or the scala side - would appreciate someone more experienced taking a look. This snippet should reproduce the error: {code:java} from pyspark.mllib.linalg.distributed import RowMatrix rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) matrix = RowMatrix(rows) sims = matrix.columnSimilarities() ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) print(sims.numRows(),sims.numCols()) ## This throws an error (stack trace below) print(sims.entries.first()) ## Later I tried this print(rows.context) # print(sims.entries.context) #, then throws an error{code} Error stack trace {code:java} --- AttributeError Traceback (most recent call last) in () > 1 sims.entries.first() /usr/lib/spark/python/pyspark/rdd.py in first(self) 1374 ValueError: RDD is empty 1375 """ -> 1376 rs = self.take(1) 1377 if rs: 1378 return rs[0] /usr/lib/spark/python/pyspark/rdd.py in take(self, num) 1356 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) 1359 1360 items += res /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 999 # SparkContext#runJob. 1000 mappedRDD = rdd.mapPartitions(partitionFunc) -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 1003 AttributeError: 'NoneType' object has no attribute 'sc' {code} PySpark columnSimilarities documentation [http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities] > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Minor > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context if spark is stopped and restarted in pyspark. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > T
[jira] [Resolved] (SPARK-24659) GenericArrayData.equals should respect element type differences
[ https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24659. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21643 [https://github.com/apache/spark/pull/21643] > GenericArrayData.equals should respect element type differences > --- > > Key: SPARK-24659 > URL: https://issues.apache.org/jira/browse/SPARK-24659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 2.4.0 > > > Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect > element type differences, due to a caveat in Scala's {{==}} operator. > e.g. {{new GenericArrayData(Array[Int](123)).equals(new > GenericArrayData(Array[Long](123L)))}} currently returns true. But that's > against the semantics of Spark SQL's array type, where {{array}} and > {{array}} are considered to be incompatible types and thus should never > be equal. > This ticket proposes to fix the implementation of {{GenericArrayData.equals}} > so that it's more aligned to Spark SQL's array type semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24659) GenericArrayData.equals should respect element type differences
[ https://issues.apache.org/jira/browse/SPARK-24659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24659: --- Assignee: Kris Mok > GenericArrayData.equals should respect element type differences > --- > > Key: SPARK-24659 > URL: https://issues.apache.org/jira/browse/SPARK-24659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 2.4.0 > > > Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect > element type differences, due to a caveat in Scala's {{==}} operator. > e.g. {{new GenericArrayData(Array[Int](123)).equals(new > GenericArrayData(Array[Long](123L)))}} currently returns true. But that's > against the semantics of Spark SQL's array type, where {{array}} and > {{array}} are considered to be incompatible types and thus should never > be equal. > This ticket proposes to fix the implementation of {{GenericArrayData.equals}} > so that it's more aligned to Spark SQL's array type semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23014) Migrate MemorySink fully to v2
[ https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524481#comment-16524481 ] Richard Yu commented on SPARK-23014: Hi [~joseph.torres] Are you still working on this PR? It seems that there was no progress on the PR for a while now. > Migrate MemorySink fully to v2 > -- > > Key: SPARK-23014 > URL: https://issues.apache.org/jira/browse/SPARK-23014 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > There's already a MemorySinkV2, but its use is controlled by a flag. We need > to remove the V1 sink and always use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24664) Column support name getter
zhengruifeng created SPARK-24664: Summary: Column support name getter Key: SPARK-24664 URL: https://issues.apache.org/jira/browse/SPARK-24664 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: zhengruifeng In spark-24557 (https://github.com/apache/spark/pull/21563), we found that it will be convenient if column supports name getter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24605) size(null) should return null
[ https://issues.apache.org/jira/browse/SPARK-24605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24605: --- Assignee: Maxim Gekk > size(null) should return null > - > > Key: SPARK-24605 > URL: https://issues.apache.org/jira/browse/SPARK-24605 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > The default behavior size(null) == -1 is a big problem for several reasons: > # It is inconsistent with how SQL functions handle nulls. > # It is an extreme violation of [the Principle of Least > Astonishment|https://en.wikipedia.org/wiki/Principle_of_least_astonishment] > (POLA) > # It is not called out anywhere in the Spark docs or even [the Hive > docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > # It can lead to subtle bugs in analytics. > For example, our client discovered this behavior while investigating > post-click user engagement in their AdTech system. The schema was per ad > placement and post-click user engagements were in an array of structs. The > culprit was > df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), > ...), which subtracted 1 for every click without post-click engagement. > Luckily, the behavior led to negative engagement counts in some periods, > which alerted them to the problem and this bizarre behavior. > Current behavior Spark inherited from Hive. The most consistent behavior, > ignoring the insanity that Hive created in the first place, is for size(null) > to behave as length(null), which returns null. This handles the aggregation > case with sum/avg, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24605) size(null) should return null
[ https://issues.apache.org/jira/browse/SPARK-24605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24605. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21598 [https://github.com/apache/spark/pull/21598] > size(null) should return null > - > > Key: SPARK-24605 > URL: https://issues.apache.org/jira/browse/SPARK-24605 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > The default behavior size(null) == -1 is a big problem for several reasons: > # It is inconsistent with how SQL functions handle nulls. > # It is an extreme violation of [the Principle of Least > Astonishment|https://en.wikipedia.org/wiki/Principle_of_least_astonishment] > (POLA) > # It is not called out anywhere in the Spark docs or even [the Hive > docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > # It can lead to subtle bugs in analytics. > For example, our client discovered this behavior while investigating > post-click user engagement in their AdTech system. The schema was per ad > placement and post-click user engagements were in an array of structs. The > culprit was > df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), > ...), which subtracted 1 for every click without post-click engagement. > Luckily, the behavior led to negative engagement counts in some periods, > which alerted them to the problem and this bizarre behavior. > Current behavior Spark inherited from Hive. The most consistent behavior, > ignoring the insanity that Hive created in the first place, is for size(null) > to behave as length(null), which returns null. This handles the aggregation > case with sum/avg, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23927) High-order function: sequence
[ https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23927: - Assignee: Alex Vayda > High-order function: sequence > - > > Key: SPARK-23927 > URL: https://issues.apache.org/jira/browse/SPARK-23927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Alex Vayda >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > * sequence(start, stop) → array > Generate a sequence of integers from start to stop, incrementing by 1 if > start is less than or equal to stop, otherwise -1. > * sequence(start, stop, step) → array > Generate a sequence of integers from start to stop, incrementing by step. > * sequence(start, stop) → array > Generate a sequence of dates from start date to stop date, incrementing by 1 > day if start date is less than or equal to stop date, otherwise -1 day. > * sequence(start, stop, step) → array > Generate a sequence of dates from start to stop, incrementing by step. The > type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH. > * sequence(start, stop, step) → array > Generate a sequence of timestamps from start to stop, incrementing by step. > The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO > MONTH. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23927) High-order function: sequence
[ https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23927. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21155 [https://github.com/apache/spark/pull/21155] > High-order function: sequence > - > > Key: SPARK-23927 > URL: https://issues.apache.org/jira/browse/SPARK-23927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Alex Vayda >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > * sequence(start, stop) → array > Generate a sequence of integers from start to stop, incrementing by 1 if > start is less than or equal to stop, otherwise -1. > * sequence(start, stop, step) → array > Generate a sequence of integers from start to stop, incrementing by step. > * sequence(start, stop) → array > Generate a sequence of dates from start date to stop date, incrementing by 1 > day if start date is less than or equal to stop date, otherwise -1 day. > * sequence(start, stop, step) → array > Generate a sequence of dates from start to stop, incrementing by step. The > type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH. > * sequence(start, stop, step) → array > Generate a sequence of timestamps from start to stop, incrementing by step. > The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO > MONTH. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524510#comment-16524510 ] Xiangrui Meng commented on SPARK-24530: --- Confirmed that macOS, python 3, and Sphinx v1.6.6 can produce correct doc on my machine. I didn't find any reports on Sphinx github. So if we could make a minimal reproducible example, we should report the issue to Sphinx. On our side, we should update the release procedure doc to use Python 3 to generate docs. We should also update the official docs that are broken (2.1.2, 2.2.1, 2.3.1). [~hyukjin.kwon] Do you have time to take this ticket? (feel free to say no if you are busy:) cc: [~smilegator] > pyspark.ml doesn't generate class docs correctly > > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (using Python 2?)
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24530: -- Summary: Sphinx doesn't render autodoc_docstring_signature correctly (using Python 2?) (was: pyspark.ml doesn't generate class docs correctly) > Sphinx doesn't render autodoc_docstring_signature correctly (using Python 2?) > - > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24530: -- Summary: Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken (was: Sphinx doesn't render autodoc_docstring_signature correctly (using Python 2?)) > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23927) High-order function: sequence
[ https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524519#comment-16524519 ] Apache Spark commented on SPARK-23927: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/21646 > High-order function: sequence > - > > Key: SPARK-23927 > URL: https://issues.apache.org/jira/browse/SPARK-23927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Alex Vayda >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > * sequence(start, stop) → array > Generate a sequence of integers from start to stop, incrementing by 1 if > start is less than or equal to stop, otherwise -1. > * sequence(start, stop, step) → array > Generate a sequence of integers from start to stop, incrementing by step. > * sequence(start, stop) → array > Generate a sequence of dates from start date to stop date, incrementing by 1 > day if start date is less than or equal to stop date, otherwise -1 day. > * sequence(start, stop, step) → array > Generate a sequence of dates from start to stop, incrementing by step. The > type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH. > * sequence(start, stop, step) → array > Generate a sequence of timestamps from start to stop, incrementing by step. > The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO > MONTH. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21335) support un-aliased subquery
[ https://issues.apache.org/jira/browse/SPARK-21335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524527#comment-16524527 ] Apache Spark commented on SPARK-21335: -- User 'cnZach' has created a pull request for this issue: https://github.com/apache/spark/pull/21647 > support un-aliased subquery > --- > > Key: SPARK-21335 > URL: https://issues.apache.org/jira/browse/SPARK-21335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: release-notes > Fix For: 2.3.0 > > > un-aliased subquery is supported by Spark SQL for a long time. Its semantic > was not well defined and has confusing behaviors, and it's not a standard SQL > syntax, so we disallowed it in > https://issues.apache.org/jira/browse/SPARK-20690 . > However, this is a breaking change, and we do have existing queries using > un-aliased subquery. We should add the support back and fix its semantic. > After the fix, there is no syntax change from branch 2.2 to master, but we > invalid a weird use case: > {{SELECT v.i from (SELECT i FROM v)}}. Now this query will throw analysis > exception because users should not be able to use the qualifier inside a > subquery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524544#comment-16524544 ] Hyukjin Kwon edited comment on SPARK-24530 at 6/27/18 4:01 AM: --- Will take a look on this weekends. Please go ahead if anyone finds some time till then :-). was (Author: hyukjin.kwon): Will take a look on the weekends. Please go ahead if anyone finds some time till then :-). > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524544#comment-16524544 ] Hyukjin Kwon commented on SPARK-24530: -- Will take a look on the weekends. Please go ahead if anyone finds some time till then :-). > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524554#comment-16524554 ] Xiao Li commented on SPARK-24530: - [~hyukjin.kwon] Thanks for helping this! > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column
[ https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524590#comment-16524590 ] Reynold Xin commented on SPARK-24642: - Do we want this as an aggregate function? I'm thinking it's better to just take a string and infers the schema on the string. How would the query you provide compile if it is an aggregate function? > Add a function which infers schema from a JSON column > - > > Key: SPARK-24642 > URL: https://issues.apache.org/jira/browse/SPARK-24642 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new aggregate function - *infer_schema()*. The function should > infer schema for set of JSON strings. The result of the function is a schema > in DDL format (or JSON format). > One of the use cases is passing output of *infer_schema()* to *from_json()*. > Currently, the from_json() function requires a schema as a mandatory > argument. It is possible to infer schema programmatically in Scala/Python and > pass it as the second argument but in SQL it is not possible. An user has to > pass schema as string literal in SQL. The new function should allow to use it > in SQL like in the example: > {code:sql} > select from_json(json_col, infer_schema(json_col)) > from json_table; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs
Li Yuanjian created SPARK-24665: --- Summary: Add SQLConf in PySpark to manage all sql configs Key: SPARK-24665 URL: https://issues.apache.org/jira/browse/SPARK-24665 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0 Reporter: Li Yuanjian With new config adding in PySpark, we currently get them by hard coding the config name and default value. We should move all the configs into a Class like what we did in Spark SQL Conf. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs
[ https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24665: Assignee: Apache Spark > Add SQLConf in PySpark to manage all sql configs > > > Key: SPARK-24665 > URL: https://issues.apache.org/jira/browse/SPARK-24665 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Assignee: Apache Spark >Priority: Major > > With new config adding in PySpark, we currently get them by hard coding the > config name and default value. We should move all the configs into a Class > like what we did in Spark SQL Conf. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs
[ https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524603#comment-16524603 ] Apache Spark commented on SPARK-24665: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/21648 > Add SQLConf in PySpark to manage all sql configs > > > Key: SPARK-24665 > URL: https://issues.apache.org/jira/browse/SPARK-24665 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Priority: Major > > With new config adding in PySpark, we currently get them by hard coding the > config name and default value. We should move all the configs into a Class > like what we did in Spark SQL Conf. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs
[ https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24665: Assignee: (was: Apache Spark) > Add SQLConf in PySpark to manage all sql configs > > > Key: SPARK-24665 > URL: https://issues.apache.org/jira/browse/SPARK-24665 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Priority: Major > > With new config adding in PySpark, we currently get them by hard coding the > config name and default value. We should move all the configs into a Class > like what we did in Spark SQL Conf. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23102) Migrate kafka sink
[ https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524615#comment-16524615 ] Richard Yu commented on SPARK-23102: Just a question: I have noted that ```KafkaStreamWriter.scala``` is already an implementation of a DataSourceV2 sink. So are we going to continue to evolve that class, or are we sticking with ```KafkaWriter```? > Migrate kafka sink > -- > > Key: SPARK-23102 > URL: https://issues.apache.org/jira/browse/SPARK-23102 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23102) Migrate kafka sink
[ https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524615#comment-16524615 ] Richard Yu edited comment on SPARK-23102 at 6/27/18 6:03 AM: - [~joseph.torres] Just a question: I have noted that ```KafkaStreamWriter.scala``` is already an implementation of a DataSourceV2 sink. So are we going to continue to evolve that class, or are we sticking with ```KafkaWriter```? was (Author: yohan123): Just a question: I have noted that ```KafkaStreamWriter.scala``` is already an implementation of a DataSourceV2 sink. So are we going to continue to evolve that class, or are we sticking with ```KafkaWriter```? > Migrate kafka sink > -- > > Key: SPARK-23102 > URL: https://issues.apache.org/jira/browse/SPARK-23102 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23102) Migrate kafka sink
[ https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524615#comment-16524615 ] Richard Yu edited comment on SPARK-23102 at 6/27/18 6:04 AM: - [~joseph.torres] Just a question: I have noted that {{KafkaStreamWriter}} is already an implementation of a DataSourceV2 sink. So are we going to continue to evolve that class, or are we sticking with {{KafkaWriter}}? was (Author: yohan123): [~joseph.torres] Just a question: I have noted that ```KafkaStreamWriter.scala``` is already an implementation of a DataSourceV2 sink. So are we going to continue to evolve that class, or are we sticking with ```KafkaWriter```? > Migrate kafka sink > -- > > Key: SPARK-23102 > URL: https://issues.apache.org/jira/browse/SPARK-23102 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-23102) Migrate kafka sink
[ https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Yu updated SPARK-23102: --- Comment: was deleted (was: [~joseph.torres] Just a question: I have noted that {{KafkaStreamWriter}} is already an implementation of a DataSourceV2 sink. So are we going to continue to evolve that class, or are we sticking with {{KafkaWriter}}? ) > Migrate kafka sink > -- > > Key: SPARK-23102 > URL: https://issues.apache.org/jira/browse/SPARK-23102 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23102) Migrate kafka sink
[ https://issues.apache.org/jira/browse/SPARK-23102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524628#comment-16524628 ] Richard Yu commented on SPARK-23102: Hi [~joseph.torres] Mind if I take this JIRA? > Migrate kafka sink > -- > > Key: SPARK-23102 > URL: https://issues.apache.org/jira/browse/SPARK-23102 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org