[jira] [Updated] (SPARK-29776) rpad returning invalid value when parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29776: - Description: As per rpad definition rpad rpad(str, len, pad) - Returns str, right-padded with pad to a length of len If str is longer than len, the return value is shortened to len characters. *In case of empty pad string, the return value is null.* Below is Example In Spark: {code} 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); ++ | rpad(hi, 5, ) | ++ | hi | ++ {code} It should return NULL as per definition. Hive behavior is correct as per definition it returns NULL when pad is empty String INFO : Concurrency mode is disabled, not creating a lock manager {code} +---+ | _c0 | +---+ | NULL | +---+ {code} was: As per rpad definition rpad rpad(str, len, pad) - Returns str, right-padded with pad to a length of len If str is longer than len, the return value is shortened to len characters. *In case of empty pad string, the return value is null.* Below is Example In Spark: 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); ++ | rpad(hi, 5, ) | ++ | hi | ++ It should return NULL as per definition. Hive behavior is correct as per definition it returns NULL when pad is empty String INFO : Concurrency mode is disabled, not creating a lock manager +---+ | _c0 | +---+ | NULL | +---+ > rpad returning invalid value when parameter is empty > > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > {code} > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > {code} > +---+ > | _c0 | > +---+ > | NULL | > +---+ > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version
[ https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29784. -- Resolution: Invalid > Built in function trim is not compatible in 3.0 with previous version > - > > Key: SPARK-29784 > URL: https://issues.apache.org/jira/browse/SPARK-29784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: image-2019-11-07-22-06-48-156.png > > > SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 > and 2.3.2 is returning after leading and trailing character removed. > Spark 3.0 – Not correct > {code} > jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+ > | trim(SL, SSparkSQLS) | > +---+ > | | > +--- > {code} > Spark 2.4 – Correct > {code} > jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+--+ > | trim(SSparkSQLS, SL) | > +---+--+ > | parkSQ | > +---+--+ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version
[ https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29784: - Description: SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 and 2.3.2 is returning after leading and trailing character removed. Spark 3.0 – Not correct {code} jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); +---+ | trim(SL, SSparkSQLS) | +---+ | | +--- {code} Spark 2.4 – Correct {code} jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); +---+--+ | trim(SSparkSQLS, SL) | +---+--+ | parkSQ | +---+--+ {code} was: SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 and 2.3.2 is returning after leading and trailing character removed. Spark 3.0 – Not correct jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); +---+ | trim(SL, SSparkSQLS) | +---+ | | +--- Spark 2.4 – Correct jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); +---+--+ | trim(SSparkSQLS, SL) | +---+--+ | parkSQ | +---+--+ > Built in function trim is not compatible in 3.0 with previous version > - > > Key: SPARK-29784 > URL: https://issues.apache.org/jira/browse/SPARK-29784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: image-2019-11-07-22-06-48-156.png > > > SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 > and 2.3.2 is returning after leading and trailing character removed. > Spark 3.0 – Not correct > {code} > jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+ > | trim(SL, SSparkSQLS) | > +---+ > | | > +--- > {code} > Spark 2.4 – Correct > {code} > jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+--+ > | trim(SSparkSQLS, SL) | > +---+--+ > | parkSQ | > +---+--+ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969914#comment-16969914 ] Hyukjin Kwon commented on SPARK-29797: -- Not sure. How should they be read into DataFrame? > Read key-value metadata in Parquet files written by Apache Arrow > > > Key: SPARK-29797 > URL: https://issues.apache.org/jira/browse/SPARK-29797 > Project: Spark > Issue Type: New Feature > Components: Java API, PySpark >Affects Versions: 2.4.4 > Environment: Apache Arrow 0.14.1 built on Windows x86. > >Reporter: Isaac Myers >Priority: Major > Labels: features > Attachments: minimal_working_example.cpp > > > Key-value (user) metadata written to Parquet file from Apache Arrow c++ is > not readable in Spark (PySpark or Java API). I can only find field-level > metadata dictionaries in the schema and no other functions in the API that > indicate the presence of file-level key-value metadata. The attached code > demonstrates creation and retrieval of file-level metadata using the Apache > Arrow API. > {code:java} > #include #include #include #include #include > #include #include > #include #include > //#include > int main(int argc, char* argv[]){ /* Create > Parquet File **/ arrow::Status st; > arrow::MemoryPool* pool = arrow::default_memory_pool(); > // Create Schema and fields with metadata > std::vector> fields; > std::unordered_map a_keyval; a_keyval["unit"] = > "sec"; a_keyval["note"] = "not the standard millisecond unit"; > arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field > = arrow::field("a", arrow::int16(), false, a_md.Copy()); > fields.push_back(a_field); > std::unordered_map b_keyval; b_keyval["unit"] = > "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr > b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); > fields.push_back(b_field); > std::shared_ptr schema = arrow::schema(fields); > // Add metadata to schema. std::unordered_map > schema_keyval; schema_keyval["classification"] = "Type 0"; > arrow::KeyValueMetadata schema_md(schema_keyval); schema = > schema->AddMetadata(schema_md.Copy()); > // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; > std::vector a_data(rowgroup_size, 0); std::vector > b_data(rowgroup_size, 0); > for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = > rowgroup_size - i; } arrow::Int16Builder a_bldr(pool); arrow::Int16Builder > b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = > b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; > st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1; > st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1; > std::shared_ptr a_arr_ptr; std::shared_ptr > b_arr_ptr; > arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) > return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if > (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr); > std::shared_ptr table = arrow::Table::Make(schema, arr_vec); > // Test metadata printf("\nMetadata from original schema:\n"); > printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", > schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", > schema->field(1)->metadata()->ToString().c_str()); > std::shared_ptr table_schema = table->schema(); > printf("\nMetadata from schema retrieved from table (should be the > same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str()); > // Open file and write table. std::string file_name = "test.parquet"; > std::shared_ptr ostream; st = > arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return > 1; > std::unique_ptr writer; > std::shared_ptr props = > parquet::default_writer_properties(); st = > parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if > (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if > (!st.ok()) return 1; > // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = > ostream->Close(); if (!st.ok()) return 1; > /* Read Parquet File > **/ > // Create new memory pool. Not sure if this is necessary. > //arrow::MemoryPool* pool2 = arrow::default_memory_pool(); > // Open file reader. std::shared_ptr input_file; st > = arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) > return 1; std::unique_ptr reader; st = > parquet::arrow::OpenFile(input_file, pool, ); if (!st.ok()) return 1;
[jira] [Issue Comment Deleted] (SPARK-29776) rpad returning invalid value when parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit raj boudh updated SPARK-29776: Comment: was deleted (was: i will start checking this issue.) > rpad returning invalid value when parameter is empty > > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > +---+ > | _c0 | > +---+ > | NULL | > +---+ > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969843#comment-16969843 ] Ankit Raj Boudh commented on SPARK-29776: - i will submit the PR soon > rpad returning invalid value when parameter is empty > > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > +---+ > | _c0 | > +---+ > | NULL | > +---+ > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29772) Add withNamespace method in test
[ https://issues.apache.org/jira/browse/SPARK-29772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29772. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26411 [https://github.com/apache/spark/pull/26411] > Add withNamespace method in test > > > Key: SPARK-29772 > URL: https://issues.apache.org/jira/browse/SPARK-29772 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.0.0 > > > V2 catalog support namespace, we should add `withNamespace` like > `withDatabase`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29772) Add withNamespace method in test
[ https://issues.apache.org/jira/browse/SPARK-29772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29772: --- Assignee: ulysses you > Add withNamespace method in test > > > Key: SPARK-29772 > URL: https://issues.apache.org/jira/browse/SPARK-29772 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > V2 catalog support namespace, we should add `withNamespace` like > `withDatabase`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29672) remove python2 tests and test infra
[ https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp updated SPARK-29672: Summary: remove python2 tests and test infra (was: remove python2 test from python/run-tests.py) > remove python2 tests and test infra > --- > > Key: SPARK-29672 > URL: https://issues.apache.org/jira/browse/SPARK-29672 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29651) Incorrect parsing of interval seconds fraction
[ https://issues.apache.org/jira/browse/SPARK-29651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29651: -- Labels: correctness (was: ) > Incorrect parsing of interval seconds fraction > -- > > Key: SPARK-29651 > URL: https://issues.apache.org/jira/browse/SPARK-29651 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Labels: correctness > Fix For: 2.4.5, 3.0.0 > > > * The fractional part of interval seconds unit is incorrectly parsed if the > number of digits is less than 9, for example: > {code} > spark-sql> select interval '10.123456 seconds'; > interval 10 seconds 123 microseconds > {code} > The result must be *interval 10 seconds 123 milliseconds 456 microseconds* > * If the seconds unit of an interval is negative, it is incorrectly converted > to `CalendarInterval`, for example: > {code} > spark-sql> select interval '-10.123456789 seconds'; > interval -9 seconds -876 milliseconds -544 microseconds > {code} > Taking into account truncation to microseconds, the result must be *interval > -10 seconds -123 milliseconds -456 microseconds* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29799) Split a kafka partition into multiple KafkaRDD partitions in the kafka external plugin for Spark Streaming
zengrui created SPARK-29799: --- Summary: Split a kafka partition into multiple KafkaRDD partitions in the kafka external plugin for Spark Streaming Key: SPARK-29799 URL: https://issues.apache.org/jira/browse/SPARK-29799 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3, 2.1.0 Reporter: zengrui When we use Spark Streaming to consume records from kafka, the generated KafkaRDD‘s partition number is equal to kafka topic's partition number, so we can not use more cpu cores to execute the streaming task except we change the topic's partition number,but we can not increase the topic's partition number infinitely. Now I think we can split a kafka partition into multiple KafkaRDD partitions, and we can config it, then we can use more cpu cores to execute the streaming task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29798) Infers bytes as binary type in Python 3 at PySpark
Hyukjin Kwon created SPARK-29798: Summary: Infers bytes as binary type in Python 3 at PySpark Key: SPARK-29798 URL: https://issues.apache.org/jira/browse/SPARK-29798 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Currently, PySpark cannot infer {{bytes}} type in Python 3. This should be accepted as binary type. See https://github.com/apache/spark/pull/25749 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29787) Move method add/subtract/negate from CalendarInterval to IntervalUtils
[ https://issues.apache.org/jira/browse/SPARK-29787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29787. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26423 [https://github.com/apache/spark/pull/26423] > Move method add/subtract/negate from CalendarInterval to IntervalUtils > -- > > Key: SPARK-29787 > URL: https://issues.apache.org/jira/browse/SPARK-29787 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > Move these tool methods to utils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29787) Move method add/subtract/negate from CalendarInterval to IntervalUtils
[ https://issues.apache.org/jira/browse/SPARK-29787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29787: --- Assignee: Kent Yao > Move method add/subtract/negate from CalendarInterval to IntervalUtils > -- > > Key: SPARK-29787 > URL: https://issues.apache.org/jira/browse/SPARK-29787 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > Move these tool methods to utils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29769: -- Description: In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on condition: {code} create temporary view s1 as select * from values (1), (3), (5), (7), (9) as s1(id); create temporary view s2 as select * from values (1), (3), (4), (6), (9) as s2(id); create temporary view s3 as select * from values (3), (4), (6), (9) as s3(id); explain extended SELECT s1.id, s2.id as id2 FROM s1 LEFT OUTER JOIN s2 ON s1.id = s2.id AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6) we will get == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- 'UnresolvedRelation `s1` +- 'UnresolvedRelation `s2` == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Optimized Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Physical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] Time taken: 1.455 seconds, Fetched 1 row(s) {code} Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved. After pr https://github.com/apache/spark/pull/25854/files We will solve subqueries in join condition and it will pass analyzer level. In current master, if we run sql above, we will get {code} == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#291] +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation [s3] :- 'UnresolvedRelation [s1] +- 'UnresolvedRelation [s2] == Analyzed Logical Plan == id: int, id2: int Project [id#244, id#250 AS id2#291] +- Join LeftOuter, ((id#244 = id#250) AND exists#290 []) : +- Project [id#256] : +- Filter (id#256 > 6) :+- SubqueryAlias `s3` : +- Project [value#253 AS id#256] : +- LocalRelation [value#253] :- SubqueryAlias `s1` : +- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- SubqueryAlias `s2` +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Optimized Logical Plan == Project [id#244, id#250 AS id2#291] +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250)) : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Physical Plan == *(2) Project [id#244, id#250 AS id2#291] +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 [] : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- *(2) Project [value#241 AS id#244] : +- *(2) LocalTableScan [value#241] +-
[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29769: -- Description: In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on condition: {code} create temporary view s1 as select * from values (1), (3), (5), (7), (9) as s1(id); create temporary view s2 as select * from values (1), (3), (4), (6), (9) as s2(id); create temporary view s3 as select * from values (3), (4), (6), (9) as s3(id); explain extended SELECT s1.id, s2.id as id2 FROM s1 LEFT OUTER JOIN s2 ON s1.id = s2.id AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6) we will get == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- 'UnresolvedRelation `s1` +- 'UnresolvedRelation `s2` == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Optimized Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Physical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] Time taken: 1.455 seconds, Fetched 1 row(s) {code} Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved. After pr https://github.com/apache/spark/pull/25854/files We will solve subqueries in join condition and it will pass analyzer level. In current master, if we run sql above, we will get {code} == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#291] +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation [s3] :- 'UnresolvedRelation [s1] +- 'UnresolvedRelation [s2] == Analyzed Logical Plan == id: int, id2: int Project [id#244, id#250 AS id2#291] +- Join LeftOuter, ((id#244 = id#250) AND exists#290 []) : +- Project [id#256] : +- Filter (id#256 > 6) :+- SubqueryAlias `s3` : +- Project [value#253 AS id#256] : +- LocalRelation [value#253] :- SubqueryAlias `s1` : +- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- SubqueryAlias `s2` +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Optimized Logical Plan == Project [id#244, id#250 AS id2#291] +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250)) : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Physical Plan == *(2) Project [id#244, id#250 AS id2#291] +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 [] : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- *(2) Project [value#241 AS id#244] : +- *(2) LocalTableScan [value#241] +-
[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-21869: -- Assignee: Gabor Somogyi > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shixiong Zhu >Assignee: Gabor Somogyi >Priority: Major > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-21869. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25853 [https://github.com/apache/spark/pull/25853] > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shixiong Zhu >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969666#comment-16969666 ] Isaac Myers commented on SPARK-29797: - The "code" block is garbage. I attached the cpp file directly. > Read key-value metadata in Parquet files written by Apache Arrow > > > Key: SPARK-29797 > URL: https://issues.apache.org/jira/browse/SPARK-29797 > Project: Spark > Issue Type: New Feature > Components: Java API, PySpark >Affects Versions: 2.4.4 > Environment: Apache Arrow 0.14.1 built on Windows x86. > >Reporter: Isaac Myers >Priority: Major > Labels: features > Attachments: minimal_working_example.cpp > > > Key-value (user) metadata written to Parquet file from Apache Arrow c++ is > not readable in Spark (PySpark or Java API). I can only find field-level > metadata dictionaries in the schema and no other functions in the API that > indicate the presence of file-level key-value metadata. The attached code > demonstrates creation and retrieval of file-level metadata using the Apache > Arrow API. > {code:java} > #include #include #include #include #include > #include #include > #include #include > //#include > int main(int argc, char* argv[]){ /* Create > Parquet File **/ arrow::Status st; > arrow::MemoryPool* pool = arrow::default_memory_pool(); > // Create Schema and fields with metadata > std::vector> fields; > std::unordered_map a_keyval; a_keyval["unit"] = > "sec"; a_keyval["note"] = "not the standard millisecond unit"; > arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field > = arrow::field("a", arrow::int16(), false, a_md.Copy()); > fields.push_back(a_field); > std::unordered_map b_keyval; b_keyval["unit"] = > "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr > b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); > fields.push_back(b_field); > std::shared_ptr schema = arrow::schema(fields); > // Add metadata to schema. std::unordered_map > schema_keyval; schema_keyval["classification"] = "Type 0"; > arrow::KeyValueMetadata schema_md(schema_keyval); schema = > schema->AddMetadata(schema_md.Copy()); > // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; > std::vector a_data(rowgroup_size, 0); std::vector > b_data(rowgroup_size, 0); > for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = > rowgroup_size - i; } arrow::Int16Builder a_bldr(pool); arrow::Int16Builder > b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = > b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; > st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1; > st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1; > std::shared_ptr a_arr_ptr; std::shared_ptr > b_arr_ptr; > arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) > return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if > (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr); > std::shared_ptr table = arrow::Table::Make(schema, arr_vec); > // Test metadata printf("\nMetadata from original schema:\n"); > printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", > schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", > schema->field(1)->metadata()->ToString().c_str()); > std::shared_ptr table_schema = table->schema(); > printf("\nMetadata from schema retrieved from table (should be the > same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str()); > // Open file and write table. std::string file_name = "test.parquet"; > std::shared_ptr ostream; st = > arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return > 1; > std::unique_ptr writer; > std::shared_ptr props = > parquet::default_writer_properties(); st = > parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if > (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if > (!st.ok()) return 1; > // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = > ostream->Close(); if (!st.ok()) return 1; > /* Read Parquet File > **/ > // Create new memory pool. Not sure if this is necessary. > //arrow::MemoryPool* pool2 = arrow::default_memory_pool(); > // Open file reader. std::shared_ptr input_file; st > = arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) > return 1; std::unique_ptr reader; st = > parquet::arrow::OpenFile(input_file, pool, ); if
[jira] [Updated] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isaac Myers updated SPARK-29797: Attachment: minimal_working_example.cpp > Read key-value metadata in Parquet files written by Apache Arrow > > > Key: SPARK-29797 > URL: https://issues.apache.org/jira/browse/SPARK-29797 > Project: Spark > Issue Type: New Feature > Components: Java API, PySpark >Affects Versions: 2.4.4 > Environment: Apache Arrow 0.14.1 built on Windows x86. > >Reporter: Isaac Myers >Priority: Major > Labels: features > Attachments: minimal_working_example.cpp > > > Key-value (user) metadata written to Parquet file from Apache Arrow c++ is > not readable in Spark (PySpark or Java API). I can only find field-level > metadata dictionaries in the schema and no other functions in the API that > indicate the presence of file-level key-value metadata. The attached code > demonstrates creation and retrieval of file-level metadata using the Apache > Arrow API. > {code:java} > #include #include #include #include #include > #include #include > #include #include > //#include > int main(int argc, char* argv[]){ /* Create > Parquet File **/ arrow::Status st; > arrow::MemoryPool* pool = arrow::default_memory_pool(); > // Create Schema and fields with metadata > std::vector> fields; > std::unordered_map a_keyval; a_keyval["unit"] = > "sec"; a_keyval["note"] = "not the standard millisecond unit"; > arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field > = arrow::field("a", arrow::int16(), false, a_md.Copy()); > fields.push_back(a_field); > std::unordered_map b_keyval; b_keyval["unit"] = > "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr > b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); > fields.push_back(b_field); > std::shared_ptr schema = arrow::schema(fields); > // Add metadata to schema. std::unordered_map > schema_keyval; schema_keyval["classification"] = "Type 0"; > arrow::KeyValueMetadata schema_md(schema_keyval); schema = > schema->AddMetadata(schema_md.Copy()); > // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; > std::vector a_data(rowgroup_size, 0); std::vector > b_data(rowgroup_size, 0); > for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = > rowgroup_size - i; } arrow::Int16Builder a_bldr(pool); arrow::Int16Builder > b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = > b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; > st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1; > st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1; > std::shared_ptr a_arr_ptr; std::shared_ptr > b_arr_ptr; > arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) > return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if > (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr); > std::shared_ptr table = arrow::Table::Make(schema, arr_vec); > // Test metadata printf("\nMetadata from original schema:\n"); > printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", > schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", > schema->field(1)->metadata()->ToString().c_str()); > std::shared_ptr table_schema = table->schema(); > printf("\nMetadata from schema retrieved from table (should be the > same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str()); > // Open file and write table. std::string file_name = "test.parquet"; > std::shared_ptr ostream; st = > arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return > 1; > std::unique_ptr writer; > std::shared_ptr props = > parquet::default_writer_properties(); st = > parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if > (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if > (!st.ok()) return 1; > // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = > ostream->Close(); if (!st.ok()) return 1; > /* Read Parquet File > **/ > // Create new memory pool. Not sure if this is necessary. > //arrow::MemoryPool* pool2 = arrow::default_memory_pool(); > // Open file reader. std::shared_ptr input_file; st > = arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) > return 1; std::unique_ptr reader; st = > parquet::arrow::OpenFile(input_file, pool, ); if (!st.ok()) return 1; > // Get schema and read metadata.
[jira] [Created] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow
Isaac Myers created SPARK-29797: --- Summary: Read key-value metadata in Parquet files written by Apache Arrow Key: SPARK-29797 URL: https://issues.apache.org/jira/browse/SPARK-29797 Project: Spark Issue Type: New Feature Components: Java API, PySpark Affects Versions: 2.4.4 Environment: Apache Arrow 0.14.1 built on Windows x86. Reporter: Isaac Myers Key-value (user) metadata written to Parquet file from Apache Arrow c++ is not readable in Spark (PySpark or Java API). I can only find field-level metadata dictionaries in the schema and no other functions in the API that indicate the presence of file-level key-value metadata. The attached code demonstrates creation and retrieval of file-level metadata using the Apache Arrow API. {code:java} #include #include #include #include #include #include #include #include #include //#include int main(int argc, char* argv[]){ /* Create Parquet File **/ arrow::Status st; arrow::MemoryPool* pool = arrow::default_memory_pool(); // Create Schema and fields with metadata std::vector> fields; std::unordered_map a_keyval; a_keyval["unit"] = "sec"; a_keyval["note"] = "not the standard millisecond unit"; arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field = arrow::field("a", arrow::int16(), false, a_md.Copy()); fields.push_back(a_field); std::unordered_map b_keyval; b_keyval["unit"] = "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); fields.push_back(b_field); std::shared_ptr schema = arrow::schema(fields); // Add metadata to schema. std::unordered_map schema_keyval; schema_keyval["classification"] = "Type 0"; arrow::KeyValueMetadata schema_md(schema_keyval); schema = schema->AddMetadata(schema_md.Copy()); // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; std::vector a_data(rowgroup_size, 0); std::vector b_data(rowgroup_size, 0); for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = rowgroup_size - i; } arrow::Int16Builder a_bldr(pool); arrow::Int16Builder b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1; st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1; std::shared_ptr a_arr_ptr; std::shared_ptr b_arr_ptr; arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr); std::shared_ptr table = arrow::Table::Make(schema, arr_vec); // Test metadata printf("\nMetadata from original schema:\n"); printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", schema->field(1)->metadata()->ToString().c_str()); std::shared_ptr table_schema = table->schema(); printf("\nMetadata from schema retrieved from table (should be the same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str()); // Open file and write table. std::string file_name = "test.parquet"; std::shared_ptr ostream; st = arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return 1; std::unique_ptr writer; std::shared_ptr props = parquet::default_writer_properties(); st = parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if (!st.ok()) return 1; // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = ostream->Close(); if (!st.ok()) return 1; /* Read Parquet File **/ // Create new memory pool. Not sure if this is necessary. //arrow::MemoryPool* pool2 = arrow::default_memory_pool(); // Open file reader. std::shared_ptr input_file; st = arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) return 1; std::unique_ptr reader; st = parquet::arrow::OpenFile(input_file, pool, ); if (!st.ok()) return 1; // Get schema and read metadata. std::shared_ptr new_schema; st = reader->GetSchema(_schema); if (!st.ok()) return 1; printf("\nMetadata from schema read from file:\n"); printf("%s\n", new_schema->metadata()->ToString().c_str()); // Crashes because there are no metadata. /*printf("%s\n", new_schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", new_schema->field(1)->metadata()->ToString().c_str());*/ printf("field name %s metadata exists: %d\n", new_schema->field(0)->name().c_str(),
[jira] [Updated] (SPARK-29794) Column level compression
[ https://issues.apache.org/jira/browse/SPARK-29794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Vyas updated SPARK-29794: - Priority: Minor (was: Major) > Column level compression > > > Key: SPARK-29794 > URL: https://issues.apache.org/jira/browse/SPARK-29794 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Anirudh Vyas >Priority: Minor > > Currently in spark we do not have capability to specify different > compressions for different columns, however this capability exists in parquet > format for example. > > Not sure if this has been opened before (I am sure it might have been but I > cannot find it), hence opening a lane for potential improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29794) Column level compression
[ https://issues.apache.org/jira/browse/SPARK-29794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Vyas updated SPARK-29794: - Priority: Major (was: Minor) > Column level compression > > > Key: SPARK-29794 > URL: https://issues.apache.org/jira/browse/SPARK-29794 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Anirudh Vyas >Priority: Major > > Currently in spark we do not have capability to specify different > compressions for different columns, however this capability exists in parquet > format for example. > > Not sure if this has been opened before (I am sure it might have been but I > cannot find it), hence opening a lane for potential improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969615#comment-16969615 ] Shane Knapp commented on SPARK-29106: - just uploaded the pip requirements.txt file that i used to get the majority of the python tests to run with. sadly, we will not be able to test against arrow/pyarrow for the foreseeable as they're moving to a full conda-forge package solution rather than pip. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp updated SPARK-29106: Attachment: arm-python36.txt > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22340. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 24898 [https://github.com/apache/spark/pull/24898] > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Mortenson >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22340: Assignee: Hyukjin Kwon > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Mortenson >Assignee: Hyukjin Kwon >Priority: Major > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29781) Override SBT Jackson-databind dependency like Maven
[ https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29781: - Assignee: Dongjoon Hyun > Override SBT Jackson-databind dependency like Maven > --- > > Key: SPARK-29781 > URL: https://issues.apache.org/jira/browse/SPARK-29781 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This is `branch-2.4` only issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29781) Override SBT Jackson-databind dependency like Maven
[ https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29781. --- Fix Version/s: 2.4.5 Resolution: Fixed Issue resolved by pull request 26417 [https://github.com/apache/spark/pull/26417] > Override SBT Jackson-databind dependency like Maven > --- > > Key: SPARK-29781 > URL: https://issues.apache.org/jira/browse/SPARK-29781 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.5 > > > This is `branch-2.4` only issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
[ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969494#comment-16969494 ] venkata yerubandi commented on SPARK-29335: --- Srini Even i am having the issue now , were you able to fix this issue -thanks -venkat > Cost Based Optimizer stats are not used while evaluating query plans in Spark > Sql > - > > Key: SPARK-29335 > URL: https://issues.apache.org/jira/browse/SPARK-29335 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 2.3.0 > Environment: We tried to execute the same using Spark-sql and Thrify > server using SQLWorkbench but we are not able to use the stats. >Reporter: Srini E >Priority: Major > Labels: Question, stack-overflow > Attachments: explain_plan_cbo_spark.txt > > > We are trying to leverage CBO for getting better plan results for few > critical queries run thru spark-sql or thru thrift server using jdbc driver. > Following settings added to spark-defaults.conf > {code} > spark.sql.cbo.enabled true > spark.experimental.extrastrategies intervaljoin > spark.sql.cbo.joinreorder.enabled true > {code} > > The tables that we are using are not partitioned. > {code} > spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; > analyze table arrow.t_fperiods_sundar compute statistics for columns eid, > year, ptype, absref, fpid , pid ; > analyze table arrow.t_fdata_sundar compute statistics ; > analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, > absref; > {code} > Analyze completed success fully. > Describe extended , does not show column level stats data and queries are not > leveraging table or column level stats . > we are using Oracle as our Hive Catalog store and not Glue . > *When we are using spark sql and running queries we are not able to see the > stats in use in the explain plan and we are not sure if cbo is put to use.* > *A quick response would be helpful.* > *Explain Plan:* > Following Explain command does not reference to any Statistics usage. > > {code} > spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref > from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = > a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 > and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* > > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = > 2017),(ptype#4546 = A),(eid#4542 = > 29940),isnull(PID#4527),isnotnull(fpid#4523) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... > 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(absref#4569),(absref#4569 = > Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: > string ... 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) > == Parsed Logical Plan == > 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] > +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && > (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && > ('a12.eid = 29940)) && isnull('a12.PID))) > +- 'Join Inner > :- 'SubqueryAlias a12 > : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` > +- 'SubqueryAlias a13 > +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` > > == Analyzed Logical Plan == > imnem: string, fvalue: string, ptype: string, absref: string > Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] > +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = > cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = > 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = > cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) > +- Join Inner > :- SubqueryAlias a12 > : +- SubqueryAlias t_fperiods_sundar > : +- >
[jira] [Resolved] (SPARK-29796) HiveExternalCatalogVersionsSuite` should ignore preview release
[ https://issues.apache.org/jira/browse/SPARK-29796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29796. --- Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 26428 [https://github.com/apache/spark/pull/26428] > HiveExternalCatalogVersionsSuite` should ignore preview release > --- > > Key: SPARK-29796 > URL: https://issues.apache.org/jira/browse/SPARK-29796 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 2.4.5, 3.0.0 > > > This issue to exclude the `preview` release to recover > `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks > `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) > should not affect `branch-2.4`. > - https://github.com/apache/spark/pull/26417 (Failed 4 times) > {code} > scala> > scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString.split("\n").filter(_.contains(""" href="spark-""")).map(""" href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1)) > java.util.NoSuchElementException: None.get > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29796) HiveExternalCatalogVersionsSuite` should ignore preview release
[ https://issues.apache.org/jira/browse/SPARK-29796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29796: - Assignee: Dongjoon Hyun > HiveExternalCatalogVersionsSuite` should ignore preview release > --- > > Key: SPARK-29796 > URL: https://issues.apache.org/jira/browse/SPARK-29796 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > > This issue to exclude the `preview` release to recover > `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks > `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) > should not affect `branch-2.4`. > - https://github.com/apache/spark/pull/26417 (Failed 4 times) > {code} > scala> > scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString.split("\n").filter(_.contains(""" href="spark-""")).map(""" href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1)) > java.util.NoSuchElementException: None.get > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29796) HiveExternalCatalogVersionsSuite` should ignore preview release
Dongjoon Hyun created SPARK-29796: - Summary: HiveExternalCatalogVersionsSuite` should ignore preview release Key: SPARK-29796 URL: https://issues.apache.org/jira/browse/SPARK-29796 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.4.5, 3.0.0 Reporter: Dongjoon Hyun This issue to exclude the `preview` release to recover `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) should not affect `branch-2.4`. - https://github.com/apache/spark/pull/26417 (Failed 4 times) {code} scala> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString.split("\n").filter(_.contains("".r.findFirstMatchIn(_).get.group(1)) java.util.NoSuchElementException: None.get {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27990) Provide a way to recursively load data from datasource
[ https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969462#comment-16969462 ] Nicholas Chammas edited comment on SPARK-27990 at 11/7/19 5:54 PM: --- Are there any docs for this new {{recursiveFileLookup}} option? I can't find anything. was (Author: nchammas): Are there any docs for this new option? I can't find anything. > Provide a way to recursively load data from datasource > -- > > Key: SPARK-27990 > URL: https://issues.apache.org/jira/browse/SPARK-27990 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Provide a way to recursively load data from datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource
[ https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969462#comment-16969462 ] Nicholas Chammas commented on SPARK-27990: -- Are there any docs for this new option? I can't find anything. > Provide a way to recursively load data from datasource > -- > > Key: SPARK-27990 > URL: https://issues.apache.org/jira/browse/SPARK-27990 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Provide a way to recursively load data from datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29795) Possible 'leak' of Metrics with dropwizard metrics 4.x
Sean R. Owen created SPARK-29795: Summary: Possible 'leak' of Metrics with dropwizard metrics 4.x Key: SPARK-29795 URL: https://issues.apache.org/jira/browse/SPARK-29795 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Sean R. Owen Assignee: Sean R. Owen This one's a little complex to explain. SPARK-29674 updated dropwizard metrics to 4.x, for Spark 3 only. That appears to be fine, according to tests. We have not and do not intend to backport it to 2.4.x. However, I'm working with a few people trying to back-port this to Spark 2.4.x separately. When this update is applied, tests fail readily with OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A heap dump analysis shows that MetricRegistry objects are retaining a gigabyte or more of memory. It appears to be holding references to many large internal Spark objects like BlockManager and Netty objects, via closures we pass to Gauge objects. Although it looked odd, this may or may not be an issue; in normal usage where a JVM hosts one SparkContext, this may normal. However in tests where contexts are started/restarted repeatedly, it seems like this might 'leak' old references to old context-related objects across runs via metrics. I don't have a clear theory on how yet (is SparkEnv shared or some ref held to it?), besides the empirical evidence. However, it's also not clear why this wouldn't affect Spark 3, apparently, as tests work fine. It could be another fix in Spark 3 that happens to help here; it could be that Spark 3 uses less memory and never hits the issue. Despite that uncertainty, I've found that simply clearing the registered metrics from MetricsSystem when it is stop()-ped seems to resolve the issue. At this point, Spark is shutting down and sinks have stopped, so there doesn't seem to be any harm in manually releasing all registered metrics and objects. I don't _think_ it's intended to track metrics across two instantiations of a SparkContext in the same JVM, but that's a question. That's the change I will propose in a PR. Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any test failures like this in 2.4 or reports of problems with metrics-related memory pressure. It could be a change in how 4.x behaves, tracks objects, manages lifecycles. The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works fine on both without the 4.x update; runs out of memory on both with the change. Why do this if this only affects 2.4 + metrics 4.x and we're not moving to metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not detected by tests. It may help apps that do for various reasons run multiple SparkContexts per JVM - like many other project test suites. It may just be good for tidiness in shutdown, to manually clear resources. Therefore I can't call this a bug per se, maybe an improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29794) Column level compression
Anirudh Vyas created SPARK-29794: Summary: Column level compression Key: SPARK-29794 URL: https://issues.apache.org/jira/browse/SPARK-29794 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: Anirudh Vyas Currently in spark we do not have capability to specify different compressions for different columns, however this capability exists in parquet format for example. Not sure if this has been opened before (I am sure it might have been but I cannot find it), hence opening a lane for potential improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29793) Display plan evolve history in AQE
Wei Xue created SPARK-29793: --- Summary: Display plan evolve history in AQE Key: SPARK-29793 URL: https://issues.apache.org/jira/browse/SPARK-29793 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Wei Xue To have a way to present the entire evolve history of an AQE plan. This can be done in two stages: # Enable an option for printing the plan changes on the go. # Display history plans in Spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29792: --- Assignee: Ke Jia > SQL metrics cannot be updated to subqueries in AQE > -- > > Key: SPARK-29792 > URL: https://issues.apache.org/jira/browse/SPARK-29792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Ke Jia >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
Wei Xue created SPARK-29792: --- Summary: SQL metrics cannot be updated to subqueries in AQE Key: SPARK-29792 URL: https://issues.apache.org/jira/browse/SPARK-29792 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wei Xue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29791) Add a spark config to allow user to use executor cores virtually.
[ https://issues.apache.org/jira/browse/SPARK-29791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zengrui updated SPARK-29791: Description: We can config the executor cores by "spark.executor.cores". For example, if we config 8 cores for a executor, then the driver can only scheduler 8 tasks to this executor concurrently. In fact, most cases a task does not always occupy a core or more. More time, tasks spent on disk IO or network IO, so we can make driver to scheduler more than 8 tasks(virtual the cores to 16,32 or more the executor report to driver) to this executor concurrently, it will make the whole job execute more quickly. (was: We can config the executor cores by "spark.executor.cores". For example, if we config 8 cores for a executor, then the driver can only scheduler 8 tasks to this executor concurrently. In fact, most cases a task does not always occupy a core or more. More time, tasks spent on disk IO or network IO, so we can make driver to scheduler more than 8 tasks to this executor concurrently, it will make the whole job execute more quickly.) > Add a spark config to allow user to use executor cores virtually. > - > > Key: SPARK-29791 > URL: https://issues.apache.org/jira/browse/SPARK-29791 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: zengrui >Priority: Minor > > We can config the executor cores by "spark.executor.cores". For example, if > we config 8 cores for a executor, then the driver can only scheduler 8 tasks > to this executor concurrently. In fact, most cases a task does not always > occupy a core or more. More time, tasks spent on disk IO or network IO, so we > can make driver to scheduler more than 8 tasks(virtual the cores to 16,32 or > more the executor report to driver) to this executor concurrently, it will > make the whole job execute more quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29791) Add a spark config to allow user to use executor cores virtually.
zengrui created SPARK-29791: --- Summary: Add a spark config to allow user to use executor cores virtually. Key: SPARK-29791 URL: https://issues.apache.org/jira/browse/SPARK-29791 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: zengrui We can config the executor cores by "spark.executor.cores". For example, if we config 8 cores for a executor, then the driver can only scheduler 8 tasks to this executor concurrently. In fact, most cases a task does not always occupy a core or more. More time, tasks spent on disk IO or network IO, so we can make driver to scheduler more than 8 tasks to this executor concurrently, it will make the whole job execute more quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969301#comment-16969301 ] Felix Kizhakkel Jose commented on SPARK-29764: -- [~hyukjin.kwon] Could you please help me with this? > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) > ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost > task 0.0 in stage 0.0 (TID 0,
[jira] [Commented] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version
[ https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969279#comment-16969279 ] Yuming Wang commented on SPARK-29784: - !image-2019-11-07-22-06-48-156.png! > Built in function trim is not compatible in 3.0 with previous version > - > > Key: SPARK-29784 > URL: https://issues.apache.org/jira/browse/SPARK-29784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: image-2019-11-07-22-06-48-156.png > > > SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 > and 2.3.2 is returning after leading and trailing character removed. > Spark 3.0 – Not correct > jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+ > | trim(SL, SSparkSQLS) | > +---+ > | | > +--- > Spark 2.4 – Correct > jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+--+ > | trim(SSparkSQLS, SL) | > +---+--+ > | parkSQ | > +---+--+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version
[ https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29784: Attachment: image-2019-11-07-22-06-48-156.png > Built in function trim is not compatible in 3.0 with previous version > - > > Key: SPARK-29784 > URL: https://issues.apache.org/jira/browse/SPARK-29784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: image-2019-11-07-22-06-48-156.png > > > SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 > and 2.3.2 is returning after leading and trailing character removed. > Spark 3.0 – Not correct > jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+ > | trim(SL, SSparkSQLS) | > +---+ > | | > +--- > Spark 2.4 – Correct > jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+--+ > | trim(SSparkSQLS, SL) | > +---+--+ > | parkSQ | > +---+--+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version
[ https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969272#comment-16969272 ] Yuming Wang commented on SPARK-29784: - Thank you [~pavithraramachandran]. We changed our non-standard syntax of trim function in https://github.com/apache/spark/pull/24902 from {{TRIM(trimStr, str)}} to {{TRIM(str, trimStr)}} to be compatible with other databases. > Built in function trim is not compatible in 3.0 with previous version > - > > Key: SPARK-29784 > URL: https://issues.apache.org/jira/browse/SPARK-29784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 > and 2.3.2 is returning after leading and trailing character removed. > Spark 3.0 – Not correct > jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+ > | trim(SL, SSparkSQLS) | > +---+ > | | > +--- > Spark 2.4 – Correct > jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+--+ > | trim(SSparkSQLS, SL) | > +---+--+ > | parkSQ | > +---+--+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29790) Add notes about port being required for Kubernetes API URL when set as master
Emil Sandstø created SPARK-29790: Summary: Add notes about port being required for Kubernetes API URL when set as master Key: SPARK-29790 URL: https://issues.apache.org/jira/browse/SPARK-29790 Project: Spark Issue Type: Documentation Components: Kubernetes Affects Versions: 2.4.4, 2.4.3 Reporter: Emil Sandstø Apparently, when configuring the master endpoint, the Kubernetes API url needs to include the port, even if it's 443. This should be noted in the documentation of "Running Spark on Kubernetes guide". Reported in the wild [https://medium.com/@kidane.weldemariam_75349/thanks-james-on-issuing-spark-submit-i-run-into-this-error-cc507d4f8f0d] I had the same issue myself. We might want to create an issue for fixing the implementation, if it's considered a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25466) Documentation does not specify how to set Kafka consumer cache capacity for SS
[ https://issues.apache.org/jira/browse/SPARK-25466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969262#comment-16969262 ] Gabor Somogyi commented on SPARK-25466: --- Well, if you think there is a bug in Spark please collect the driver + executor logs, open a new bug jira and attach it. Having partial code on Stack Overflow which is half way DStreams, half way Structured Streaming and it was originally throwing exception by default doesn't help. > Documentation does not specify how to set Kafka consumer cache capacity for SS > -- > > Key: SPARK-25466 > URL: https://issues.apache.org/jira/browse/SPARK-25466 > Project: Spark > Issue Type: Improvement > Components: Documentation, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Patrick McGloin >Priority: Minor > > When hitting this warning with SS: > 19-09-2018 12:05:27 WARN CachedKafkaConsumer:66 - KafkaConsumer cache > hitting max capacity of 64, removing consumer for > CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30) > If you Google you get to this page: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > Which is for Spark Streaming and says to use this config item to adjust the > capacity: "spark.streaming.kafka.consumer.cache.maxCapacity". > This is a bit confusing as SS uses a different config item: > "spark.sql.kafkaConsumerCache.capacity" > Perhaps the SS Kafka documentation should talk about the consumer cache > capacity? Perhaps here? > https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html > Or perhaps the warning message should reference the config item. E.g > 19-09-2018 12:05:27 WARN CachedKafkaConsumer:66 - KafkaConsumer cache > hitting max capacity of 64, removing consumer for > CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30). > *The cache size can be adjusted with the setting > "spark.sql.kafkaConsumerCache.capacity".* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29789) should not parse the bucket column name again when creating v2 tables
Wenchen Fan created SPARK-29789: --- Summary: should not parse the bucket column name again when creating v2 tables Key: SPARK-29789 URL: https://issues.apache.org/jira/browse/SPARK-29789 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29788) Fix the typo errors in the SQL reference document merges
[ https://issues.apache.org/jira/browse/SPARK-29788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969226#comment-16969226 ] jobit mathew commented on SPARK-29788: -- I will work on this. > Fix the typo errors in the SQL reference document merges > > > Key: SPARK-29788 > URL: https://issues.apache.org/jira/browse/SPARK-29788 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Fix the typo errors in the SQL reference document merges. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29788) Fix the typo errors in the SQL reference document merges
jobit mathew created SPARK-29788: Summary: Fix the typo errors in the SQL reference document merges Key: SPARK-29788 URL: https://issues.apache.org/jira/browse/SPARK-29788 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.0.0 Reporter: jobit mathew Fix the typo errors in the SQL reference document merges. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29757) Move calendar interval constants together
[ https://issues.apache.org/jira/browse/SPARK-29757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29757. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26399 [https://github.com/apache/spark/pull/26399] > Move calendar interval constants together > - > > Key: SPARK-29757 > URL: https://issues.apache.org/jira/browse/SPARK-29757 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > {code:java} > public static final Byte DAYS_PER_MONTH = 30; > public static final Byte MONTHS_PER_QUARTER = 3; > public static final int MONTHS_PER_YEAR = 12; > public static final int YEARS_PER_MILLENNIUM = 1000; > public static final int YEARS_PER_CENTURY = 100; > public static final int YEARS_PER_DECADE = 10; > public static final long NANOS_PER_MICRO = 1000L; > public static final long MILLIS_PER_SECOND = 1000L; > public static final long MICROS_PER_MILLI = 1000L; > public static final long SECONDS_PER_DAY = 24L * 60 * 60; > public static final long MILLIS_PER_MINUTE = 60 * MILLIS_PER_SECOND; > public static final long MILLIS_PER_HOUR = 60 * MILLIS_PER_MINUTE; > public static final long MILLIS_PER_DAY = SECONDS_PER_DAY * > MILLIS_PER_SECOND; > public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * > MICROS_PER_MILLI; > public static final long MICROS_PER_MINUTE = MILLIS_PER_MINUTE * > MICROS_PER_MILLI; > public static final long MICROS_PER_HOUR = MILLIS_PER_HOUR * > MICROS_PER_MILLI; > public static final long MICROS_PER_DAY = SECONDS_PER_DAY * > MICROS_PER_SECOND; > public static final long MICROS_PER_WEEK = MICROS_PER_DAY * 7; > public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY; > /* 365.25 days per year assumes leap year every four years */ > public static final long MICROS_PER_YEAR = (36525L * MICROS_PER_DAY) / 100; > public static final long NANOS_PER_SECOND = NANOS_PER_MICRO * > MICROS_PER_SECOND; > public static final long NANOS_PER_MILLIS = NANOS_PER_MICRO * > MICROS_PER_MILLI; > {code} > The above parameters are defined in IntervalUtils, DateTimeUtils, and > CalendarInterval, some of them are redundant, some of them are > cross-referenced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29757) Move calendar interval constants together
[ https://issues.apache.org/jira/browse/SPARK-29757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29757: --- Assignee: Kent Yao > Move calendar interval constants together > - > > Key: SPARK-29757 > URL: https://issues.apache.org/jira/browse/SPARK-29757 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > {code:java} > public static final Byte DAYS_PER_MONTH = 30; > public static final Byte MONTHS_PER_QUARTER = 3; > public static final int MONTHS_PER_YEAR = 12; > public static final int YEARS_PER_MILLENNIUM = 1000; > public static final int YEARS_PER_CENTURY = 100; > public static final int YEARS_PER_DECADE = 10; > public static final long NANOS_PER_MICRO = 1000L; > public static final long MILLIS_PER_SECOND = 1000L; > public static final long MICROS_PER_MILLI = 1000L; > public static final long SECONDS_PER_DAY = 24L * 60 * 60; > public static final long MILLIS_PER_MINUTE = 60 * MILLIS_PER_SECOND; > public static final long MILLIS_PER_HOUR = 60 * MILLIS_PER_MINUTE; > public static final long MILLIS_PER_DAY = SECONDS_PER_DAY * > MILLIS_PER_SECOND; > public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * > MICROS_PER_MILLI; > public static final long MICROS_PER_MINUTE = MILLIS_PER_MINUTE * > MICROS_PER_MILLI; > public static final long MICROS_PER_HOUR = MILLIS_PER_HOUR * > MICROS_PER_MILLI; > public static final long MICROS_PER_DAY = SECONDS_PER_DAY * > MICROS_PER_SECOND; > public static final long MICROS_PER_WEEK = MICROS_PER_DAY * 7; > public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY; > /* 365.25 days per year assumes leap year every four years */ > public static final long MICROS_PER_YEAR = (36525L * MICROS_PER_DAY) / 100; > public static final long NANOS_PER_SECOND = NANOS_PER_MICRO * > MICROS_PER_SECOND; > public static final long NANOS_PER_MILLIS = NANOS_PER_MICRO * > MICROS_PER_MILLI; > {code} > The above parameters are defined in IntervalUtils, DateTimeUtils, and > CalendarInterval, some of them are redundant, some of them are > cross-referenced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969167#comment-16969167 ] Ankit Raj Boudh commented on SPARK-29760: - @Sean R. Owen , i will raise PR for this. > Document VALUES statement in SQL Reference. > --- > > Key: SPARK-29760 > URL: https://issues.apache.org/jira/browse/SPARK-29760 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > spark-sql also supports *VALUES *. > {code:java} > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three'); > 1 one > 2 two > 3 three > Time taken: 0.015 seconds, Fetched 3 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2; > 1 one > 2 two > Time taken: 0.014 seconds, Fetched 2 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2; > 1 one > 3 three > 2 two > Time taken: 0.153 seconds, Fetched 3 row(s) > spark-sql> > {code} > or even *values *can be used along with INSERT INTO or select. > refer: https://www.postgresql.org/docs/current/sql-values.html > So please confirm VALUES also can be documented or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29787) Move method add/subtract/negate from CalendarInterval to IntervalUtils
Kent Yao created SPARK-29787: Summary: Move method add/subtract/negate from CalendarInterval to IntervalUtils Key: SPARK-29787 URL: https://issues.apache.org/jira/browse/SPARK-29787 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Move these tool methods to utils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org