[jira] [Updated] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29776:
-
Description: 
As per rpad definition
 rpad
 rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
 If str is longer than len, the return value is shortened to len characters.
 *In case of empty pad string, the return value is null.*

Below is Example

In Spark:

{code}
0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
++
| rpad(hi, 5, ) |
++
| hi |
++
{code}

It should return NULL as per definition.

 

Hive behavior is correct as per definition it returns NULL when pad is empty 
String

INFO : Concurrency mode is disabled, not creating a lock manager
{code}
+---+
| _c0 |
+---+
| NULL |
+---+
{code}

 

 

 

  was:
As per rpad definition
 rpad
 rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
 If str is longer than len, the return value is shortened to len characters.
 *In case of empty pad string, the return value is null.*

Below is Example

In Spark:

0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
++
| rpad(hi, 5, ) |
++
| hi |
++

It should return NULL as per definition.

 

Hive behavior is correct as per definition it returns NULL when pad is empty 
String

INFO : Concurrency mode is disabled, not creating a lock manager
+---+
| _c0 |
+---+
| NULL |
+---+

 

 

 


> rpad returning invalid value when parameter is empty
> 
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> {code}
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> {code}
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version

2019-11-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29784.
--
Resolution: Invalid

> Built in function trim is not compatible in 3.0 with previous version
> -
>
> Key: SPARK-29784
> URL: https://issues.apache.org/jira/browse/SPARK-29784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: image-2019-11-07-22-06-48-156.png
>
>
> SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 
> and 2.3.2 is returning after leading and trailing character removed.
> Spark 3.0 – Not correct
> {code}
> jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+
> | trim(SL, SSparkSQLS) |
> +---+
> | |
> +---
> {code}
> Spark 2.4 – Correct
> {code}
> jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+--+
> | trim(SSparkSQLS, SL) |
> +---+--+
> | parkSQ |
> +---+--+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version

2019-11-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29784:
-
Description: 
SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 and 
2.3.2 is returning after leading and trailing character removed.

Spark 3.0 – Not correct

{code}
jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
+---+
| trim(SL, SSparkSQLS) |
+---+
| |
+---
{code}

Spark 2.4 – Correct

{code}
jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
+---+--+
| trim(SSparkSQLS, SL) |
+---+--+
| parkSQ |
+---+--+
{code}

 

  was:
SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 and 
2.3.2 is returning after leading and trailing character removed.

Spark 3.0 – Not correct

jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
+---+
| trim(SL, SSparkSQLS) |
+---+
| |
+---

Spark 2.4 – Correct

jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
+---+--+
| trim(SSparkSQLS, SL) |
+---+--+
| parkSQ |
+---+--+

 


> Built in function trim is not compatible in 3.0 with previous version
> -
>
> Key: SPARK-29784
> URL: https://issues.apache.org/jira/browse/SPARK-29784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: image-2019-11-07-22-06-48-156.png
>
>
> SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 
> and 2.3.2 is returning after leading and trailing character removed.
> Spark 3.0 – Not correct
> {code}
> jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+
> | trim(SL, SSparkSQLS) |
> +---+
> | |
> +---
> {code}
> Spark 2.4 – Correct
> {code}
> jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+--+
> | trim(SSparkSQLS, SL) |
> +---+--+
> | parkSQ |
> +---+--+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow

2019-11-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969914#comment-16969914
 ] 

Hyukjin Kwon commented on SPARK-29797:
--

Not sure. How should they be read into DataFrame?

> Read key-value metadata in Parquet files written by Apache Arrow
> 
>
> Key: SPARK-29797
> URL: https://issues.apache.org/jira/browse/SPARK-29797
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, PySpark
>Affects Versions: 2.4.4
> Environment: Apache Arrow 0.14.1 built on Windows x86. 
>  
>Reporter: Isaac Myers
>Priority: Major
>  Labels: features
> Attachments: minimal_working_example.cpp
>
>
> Key-value (user) metadata written to Parquet file from Apache Arrow c++ is 
> not readable in Spark (PySpark or Java API). I can only find field-level 
> metadata dictionaries in the schema and no other functions in the API that 
> indicate the presence of file-level key-value metadata. The attached code 
> demonstrates creation and retrieval of file-level metadata using the Apache 
> Arrow API.
> {code:java}
> #include #include #include #include #include 
> #include #include 
> #include #include 
> //#include 
> int main(int argc, char* argv[]){ /* Create 
> Parquet File **/ arrow::Status st; 
> arrow::MemoryPool* pool = arrow::default_memory_pool();
>  // Create Schema and fields with metadata 
> std::vector> fields;
>  std::unordered_map a_keyval; a_keyval["unit"] = 
> "sec"; a_keyval["note"] = "not the standard millisecond unit"; 
> arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field 
> = arrow::field("a", arrow::int16(), false, a_md.Copy()); 
> fields.push_back(a_field);
>  std::unordered_map b_keyval; b_keyval["unit"] = 
> "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr 
> b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); 
> fields.push_back(b_field);
>  std::shared_ptr schema = arrow::schema(fields);
>  // Add metadata to schema. std::unordered_map 
> schema_keyval; schema_keyval["classification"] = "Type 0"; 
> arrow::KeyValueMetadata schema_md(schema_keyval); schema = 
> schema->AddMetadata(schema_md.Copy());
>  // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; 
> std::vector a_data(rowgroup_size, 0); std::vector 
> b_data(rowgroup_size, 0);
>  for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = 
> rowgroup_size - i; }  arrow::Int16Builder a_bldr(pool); arrow::Int16Builder 
> b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = 
> b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1;
>  st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1;
>  st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1;
>  std::shared_ptr a_arr_ptr; std::shared_ptr 
> b_arr_ptr;
>  arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) 
> return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if 
> (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr);
>  std::shared_ptr table = arrow::Table::Make(schema, arr_vec);
>  // Test metadata printf("\nMetadata from original schema:\n"); 
> printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(1)->metadata()->ToString().c_str());
>  std::shared_ptr table_schema = table->schema(); 
> printf("\nMetadata from schema retrieved from table (should be the 
> same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());
>  // Open file and write table. std::string file_name = "test.parquet"; 
> std::shared_ptr ostream; st = 
> arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return 
> 1;
>  std::unique_ptr writer; 
> std::shared_ptr props = 
> parquet::default_writer_properties(); st = 
> parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if 
> (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if 
> (!st.ok()) return 1;
>  // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = 
> ostream->Close(); if (!st.ok()) return 1;
>  /* Read Parquet File 
> **/
>  // Create new memory pool. Not sure if this is necessary. 
> //arrow::MemoryPool* pool2 = arrow::default_memory_pool();
>  // Open file reader. std::shared_ptr input_file; st 
> = arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) 
> return 1; std::unique_ptr reader; st = 
> parquet::arrow::OpenFile(input_file, pool, ); if (!st.ok()) return 1;

[jira] [Issue Comment Deleted] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-07 Thread Ankit raj boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit raj boudh updated SPARK-29776:

Comment: was deleted

(was: i will start checking this issue.)

> rpad returning invalid value when parameter is empty
> 
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-07 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969843#comment-16969843
 ] 

Ankit Raj Boudh commented on SPARK-29776:
-

i will submit the PR soon

> rpad returning invalid value when parameter is empty
> 
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29772) Add withNamespace method in test

2019-11-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29772.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26411
[https://github.com/apache/spark/pull/26411]

> Add withNamespace method in test
> 
>
> Key: SPARK-29772
> URL: https://issues.apache.org/jira/browse/SPARK-29772
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.0.0
>
>
> V2 catalog support namespace, we should add `withNamespace` like 
> `withDatabase`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29772) Add withNamespace method in test

2019-11-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29772:
---

Assignee: ulysses you

> Add withNamespace method in test
> 
>
> Key: SPARK-29772
> URL: https://issues.apache.org/jira/browse/SPARK-29772
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> V2 catalog support namespace, we should add `withNamespace` like 
> `withDatabase`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29672) remove python2 tests and test infra

2019-11-07 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp updated SPARK-29672:

Summary: remove python2 tests and test infra  (was: remove python2 test 
from python/run-tests.py)

> remove python2 tests and test infra
> ---
>
> Key: SPARK-29672
> URL: https://issues.apache.org/jira/browse/SPARK-29672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29651) Incorrect parsing of interval seconds fraction

2019-11-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29651:
--
Labels: correctness  (was: )

> Incorrect parsing of interval seconds fraction
> --
>
> Key: SPARK-29651
> URL: https://issues.apache.org/jira/browse/SPARK-29651
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> * The fractional part of interval seconds unit is incorrectly parsed if the 
> number of digits is less than 9, for example:
> {code}
> spark-sql> select interval '10.123456 seconds';
> interval 10 seconds 123 microseconds
> {code}
> The result must be *interval 10 seconds 123 milliseconds 456 microseconds*
> * If the seconds unit of an interval is negative, it is incorrectly converted 
> to `CalendarInterval`, for example:
> {code}
> spark-sql> select interval '-10.123456789 seconds';
> interval -9 seconds -876 milliseconds -544 microseconds
> {code}
> Taking into account truncation to microseconds, the result must be *interval 
> -10 seconds -123 milliseconds -456 microseconds*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29799) Split a kafka partition into multiple KafkaRDD partitions in the kafka external plugin for Spark Streaming

2019-11-07 Thread zengrui (Jira)
zengrui created SPARK-29799:
---

 Summary: Split a kafka partition into multiple KafkaRDD partitions 
in the kafka external plugin for Spark Streaming
 Key: SPARK-29799
 URL: https://issues.apache.org/jira/browse/SPARK-29799
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3, 2.1.0
Reporter: zengrui


When we use Spark Streaming to consume records from kafka, the generated 
KafkaRDD‘s partition number is equal to kafka topic's partition number, so we 
can not use more cpu cores to execute the streaming task except we change the 
topic's partition number,but we can not increase the topic's partition number 
infinitely.

Now I think we can split a kafka partition into multiple KafkaRDD partitions, 
and we can config

it, then we can use more cpu cores to execute the streaming task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29798) Infers bytes as binary type in Python 3 at PySpark

2019-11-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-29798:


 Summary: Infers bytes as binary type in Python 3 at PySpark
 Key: SPARK-29798
 URL: https://issues.apache.org/jira/browse/SPARK-29798
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Currently, PySpark cannot infer {{bytes}} type in Python 3. This should be 
accepted as binary type. See https://github.com/apache/spark/pull/25749



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29787) Move method add/subtract/negate from CalendarInterval to IntervalUtils

2019-11-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29787.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26423
[https://github.com/apache/spark/pull/26423]

> Move method add/subtract/negate from CalendarInterval to IntervalUtils
> --
>
> Key: SPARK-29787
> URL: https://issues.apache.org/jira/browse/SPARK-29787
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Move these tool methods to utils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29787) Move method add/subtract/negate from CalendarInterval to IntervalUtils

2019-11-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29787:
---

Assignee: Kent Yao

> Move method add/subtract/negate from CalendarInterval to IntervalUtils
> --
>
> Key: SPARK-29787
> URL: https://issues.apache.org/jira/browse/SPARK-29787
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> Move these tool methods to utils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

2019-11-07 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29769:
--
Description: 
 In origin master, we can'y run sql  use `EXISTS/NOT EXISTS` in Join's on 
condition:
{code}
create temporary view s1 as select * from values
(1), (3), (5), (7), (9)
  as s1(id);

create temporary view s2 as select * from values
(1), (3), (4), (6), (9)
  as s2(id);

create temporary view s3 as select * from values
(3), (4), (6), (9)
  as s3(id);

 explain extended SELECT s1.id, s2.id as id2 FROM s1
 LEFT OUTER JOIN s2 ON s1.id = s2.id
 AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
 
we will get

== Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- 'UnresolvedRelation `s1`
   +- 'UnresolvedRelation `s2`

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Physical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]
Time taken: 1.455 seconds, Fetched 1 row(s)
{code}

Since in analyzer , it won't solve join's condition's SubQuery in 
*Analyzer.ResolveSubquery*, then table *s3* was unresolved. 

After pr https://github.com/apache/spark/pull/25854/files

We will solve subqueries in join condition and it will pass analyzer level.

In current master, if we run sql above, we will get
{code}
 == Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#291]
+- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation [s3]
   :- 'UnresolvedRelation [s1]
   +- 'UnresolvedRelation [s2]

== Analyzed Logical Plan ==
id: int, id2: int
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
   :  +- Project [id#256]
   : +- Filter (id#256 > 6)
   :+- SubqueryAlias `s3`
   :   +- Project [value#253 AS id#256]
   :  +- LocalRelation [value#253]
   :- SubqueryAlias `s1`
   :  +- Project [value#241 AS id#244]
   : +- LocalRelation [value#241]
   +- SubqueryAlias `s2`
  +- Project [value#247 AS id#250]
 +- LocalRelation [value#247]

== Optimized Logical Plan ==
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- Project [value#241 AS id#244]
   :  +- LocalRelation [value#241]
   +- Project [value#247 AS id#250]
  +- LocalRelation [value#247]

== Physical Plan ==
*(2) Project [id#244, id#250 AS id2#291]
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 
[]
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- *(2) Project [value#241 AS id#244]
   :  +- *(2) LocalTableScan [value#241]
   +- 

[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

2019-11-07 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29769:
--
Description: 
 In origin master, we can'y run sql  use `EXISTS/NOT EXISTS` in Join's on 
condition:
{code}
create temporary view s1 as select * from values
(1), (3), (5), (7), (9)
  as s1(id);

create temporary view s2 as select * from values
(1), (3), (4), (6), (9)
  as s2(id);

create temporary view s3 as select * from values
(3), (4), (6), (9)
  as s3(id);

 explain extended SELECT s1.id, s2.id as id2 FROM s1
 LEFT OUTER JOIN s2 ON s1.id = s2.id
 AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
 
we will get

== Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- 'UnresolvedRelation `s1`
   +- 'UnresolvedRelation `s2`

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Physical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]
Time taken: 1.455 seconds, Fetched 1 row(s)
{code}

Since in analyzer , it won't solve join's condition's SubQuery in 
*Analyzer.ResolveSubquery*, then table *s3* was unresolved. 

After pr https://github.com/apache/spark/pull/25854/files

We will solve subqueries in join condition and it will pass analyzer level.

In current master, if we run sql above, we will get
{code}
 == Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#291]
+- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation [s3]
   :- 'UnresolvedRelation [s1]
   +- 'UnresolvedRelation [s2]

== Analyzed Logical Plan ==
id: int, id2: int
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
   :  +- Project [id#256]
   : +- Filter (id#256 > 6)
   :+- SubqueryAlias `s3`
   :   +- Project [value#253 AS id#256]
   :  +- LocalRelation [value#253]
   :- SubqueryAlias `s1`
   :  +- Project [value#241 AS id#244]
   : +- LocalRelation [value#241]
   +- SubqueryAlias `s2`
  +- Project [value#247 AS id#250]
 +- LocalRelation [value#247]

== Optimized Logical Plan ==
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- Project [value#241 AS id#244]
   :  +- LocalRelation [value#241]
   +- Project [value#247 AS id#250]
  +- LocalRelation [value#247]

== Physical Plan ==
*(2) Project [id#244, id#250 AS id2#291]
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 
[]
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- *(2) Project [value#241 AS id#244]
   :  +- *(2) LocalTableScan [value#241]
   +- 

[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-11-07 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-21869:
--

Assignee: Gabor Somogyi

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-11-07 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-21869.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25853
[https://github.com/apache/spark/pull/25853]

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow

2019-11-07 Thread Isaac Myers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969666#comment-16969666
 ] 

Isaac Myers commented on SPARK-29797:
-

The "code" block is garbage. I attached the cpp file directly. 

> Read key-value metadata in Parquet files written by Apache Arrow
> 
>
> Key: SPARK-29797
> URL: https://issues.apache.org/jira/browse/SPARK-29797
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, PySpark
>Affects Versions: 2.4.4
> Environment: Apache Arrow 0.14.1 built on Windows x86. 
>  
>Reporter: Isaac Myers
>Priority: Major
>  Labels: features
> Attachments: minimal_working_example.cpp
>
>
> Key-value (user) metadata written to Parquet file from Apache Arrow c++ is 
> not readable in Spark (PySpark or Java API). I can only find field-level 
> metadata dictionaries in the schema and no other functions in the API that 
> indicate the presence of file-level key-value metadata. The attached code 
> demonstrates creation and retrieval of file-level metadata using the Apache 
> Arrow API.
> {code:java}
> #include #include #include #include #include 
> #include #include 
> #include #include 
> //#include 
> int main(int argc, char* argv[]){ /* Create 
> Parquet File **/ arrow::Status st; 
> arrow::MemoryPool* pool = arrow::default_memory_pool();
>  // Create Schema and fields with metadata 
> std::vector> fields;
>  std::unordered_map a_keyval; a_keyval["unit"] = 
> "sec"; a_keyval["note"] = "not the standard millisecond unit"; 
> arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field 
> = arrow::field("a", arrow::int16(), false, a_md.Copy()); 
> fields.push_back(a_field);
>  std::unordered_map b_keyval; b_keyval["unit"] = 
> "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr 
> b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); 
> fields.push_back(b_field);
>  std::shared_ptr schema = arrow::schema(fields);
>  // Add metadata to schema. std::unordered_map 
> schema_keyval; schema_keyval["classification"] = "Type 0"; 
> arrow::KeyValueMetadata schema_md(schema_keyval); schema = 
> schema->AddMetadata(schema_md.Copy());
>  // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; 
> std::vector a_data(rowgroup_size, 0); std::vector 
> b_data(rowgroup_size, 0);
>  for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = 
> rowgroup_size - i; }  arrow::Int16Builder a_bldr(pool); arrow::Int16Builder 
> b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = 
> b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1;
>  st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1;
>  st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1;
>  std::shared_ptr a_arr_ptr; std::shared_ptr 
> b_arr_ptr;
>  arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) 
> return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if 
> (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr);
>  std::shared_ptr table = arrow::Table::Make(schema, arr_vec);
>  // Test metadata printf("\nMetadata from original schema:\n"); 
> printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(1)->metadata()->ToString().c_str());
>  std::shared_ptr table_schema = table->schema(); 
> printf("\nMetadata from schema retrieved from table (should be the 
> same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());
>  // Open file and write table. std::string file_name = "test.parquet"; 
> std::shared_ptr ostream; st = 
> arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return 
> 1;
>  std::unique_ptr writer; 
> std::shared_ptr props = 
> parquet::default_writer_properties(); st = 
> parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if 
> (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if 
> (!st.ok()) return 1;
>  // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = 
> ostream->Close(); if (!st.ok()) return 1;
>  /* Read Parquet File 
> **/
>  // Create new memory pool. Not sure if this is necessary. 
> //arrow::MemoryPool* pool2 = arrow::default_memory_pool();
>  // Open file reader. std::shared_ptr input_file; st 
> = arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) 
> return 1; std::unique_ptr reader; st = 
> parquet::arrow::OpenFile(input_file, pool, ); if 

[jira] [Updated] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow

2019-11-07 Thread Isaac Myers (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isaac Myers updated SPARK-29797:

Attachment: minimal_working_example.cpp

> Read key-value metadata in Parquet files written by Apache Arrow
> 
>
> Key: SPARK-29797
> URL: https://issues.apache.org/jira/browse/SPARK-29797
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, PySpark
>Affects Versions: 2.4.4
> Environment: Apache Arrow 0.14.1 built on Windows x86. 
>  
>Reporter: Isaac Myers
>Priority: Major
>  Labels: features
> Attachments: minimal_working_example.cpp
>
>
> Key-value (user) metadata written to Parquet file from Apache Arrow c++ is 
> not readable in Spark (PySpark or Java API). I can only find field-level 
> metadata dictionaries in the schema and no other functions in the API that 
> indicate the presence of file-level key-value metadata. The attached code 
> demonstrates creation and retrieval of file-level metadata using the Apache 
> Arrow API.
> {code:java}
> #include #include #include #include #include 
> #include #include 
> #include #include 
> //#include 
> int main(int argc, char* argv[]){ /* Create 
> Parquet File **/ arrow::Status st; 
> arrow::MemoryPool* pool = arrow::default_memory_pool();
>  // Create Schema and fields with metadata 
> std::vector> fields;
>  std::unordered_map a_keyval; a_keyval["unit"] = 
> "sec"; a_keyval["note"] = "not the standard millisecond unit"; 
> arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field 
> = arrow::field("a", arrow::int16(), false, a_md.Copy()); 
> fields.push_back(a_field);
>  std::unordered_map b_keyval; b_keyval["unit"] = 
> "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr 
> b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); 
> fields.push_back(b_field);
>  std::shared_ptr schema = arrow::schema(fields);
>  // Add metadata to schema. std::unordered_map 
> schema_keyval; schema_keyval["classification"] = "Type 0"; 
> arrow::KeyValueMetadata schema_md(schema_keyval); schema = 
> schema->AddMetadata(schema_md.Copy());
>  // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; 
> std::vector a_data(rowgroup_size, 0); std::vector 
> b_data(rowgroup_size, 0);
>  for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = 
> rowgroup_size - i; }  arrow::Int16Builder a_bldr(pool); arrow::Int16Builder 
> b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = 
> b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1;
>  st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1;
>  st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1;
>  std::shared_ptr a_arr_ptr; std::shared_ptr 
> b_arr_ptr;
>  arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) 
> return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if 
> (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr);
>  std::shared_ptr table = arrow::Table::Make(schema, arr_vec);
>  // Test metadata printf("\nMetadata from original schema:\n"); 
> printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(1)->metadata()->ToString().c_str());
>  std::shared_ptr table_schema = table->schema(); 
> printf("\nMetadata from schema retrieved from table (should be the 
> same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());
>  // Open file and write table. std::string file_name = "test.parquet"; 
> std::shared_ptr ostream; st = 
> arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return 
> 1;
>  std::unique_ptr writer; 
> std::shared_ptr props = 
> parquet::default_writer_properties(); st = 
> parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if 
> (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if 
> (!st.ok()) return 1;
>  // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = 
> ostream->Close(); if (!st.ok()) return 1;
>  /* Read Parquet File 
> **/
>  // Create new memory pool. Not sure if this is necessary. 
> //arrow::MemoryPool* pool2 = arrow::default_memory_pool();
>  // Open file reader. std::shared_ptr input_file; st 
> = arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) 
> return 1; std::unique_ptr reader; st = 
> parquet::arrow::OpenFile(input_file, pool, ); if (!st.ok()) return 1;
>  // Get schema and read metadata. 

[jira] [Created] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow

2019-11-07 Thread Isaac Myers (Jira)
Isaac Myers created SPARK-29797:
---

 Summary: Read key-value metadata in Parquet files written by 
Apache Arrow
 Key: SPARK-29797
 URL: https://issues.apache.org/jira/browse/SPARK-29797
 Project: Spark
  Issue Type: New Feature
  Components: Java API, PySpark
Affects Versions: 2.4.4
 Environment: Apache Arrow 0.14.1 built on Windows x86. 

 
Reporter: Isaac Myers


Key-value (user) metadata written to Parquet file from Apache Arrow c++ is not 
readable in Spark (PySpark or Java API). I can only find field-level metadata 
dictionaries in the schema and no other functions in the API that indicate the 
presence of file-level key-value metadata. The attached code demonstrates 
creation and retrieval of file-level metadata using the Apache Arrow API.
{code:java}
#include #include #include #include #include 
#include #include #include 
#include //#include 
int main(int argc, char* argv[]){ /* Create 
Parquet File **/ arrow::Status st; 
arrow::MemoryPool* pool = arrow::default_memory_pool();
 // Create Schema and fields with metadata 
std::vector> fields;
 std::unordered_map a_keyval; a_keyval["unit"] = 
"sec"; a_keyval["note"] = "not the standard millisecond unit"; 
arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field = 
arrow::field("a", arrow::int16(), false, a_md.Copy()); 
fields.push_back(a_field);
 std::unordered_map b_keyval; b_keyval["unit"] = 
"ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr 
b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); 
fields.push_back(b_field);
 std::shared_ptr schema = arrow::schema(fields);
 // Add metadata to schema. std::unordered_map 
schema_keyval; schema_keyval["classification"] = "Type 0"; 
arrow::KeyValueMetadata schema_md(schema_keyval); schema = 
schema->AddMetadata(schema_md.Copy());
 // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; 
std::vector a_data(rowgroup_size, 0); std::vector 
b_data(rowgroup_size, 0);
 for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = 
rowgroup_size - i; }  arrow::Int16Builder a_bldr(pool); arrow::Int16Builder 
b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = 
b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1;
 st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1;
 st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1;
 std::shared_ptr a_arr_ptr; std::shared_ptr 
b_arr_ptr;
 arrow::ArrayVector arr_vec; st = a_bldr.Finish(_arr_ptr); if (!st.ok()) 
return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(_arr_ptr); if 
(!st.ok()) return 1; arr_vec.push_back(b_arr_ptr);
 std::shared_ptr table = arrow::Table::Make(schema, arr_vec);
 // Test metadata printf("\nMetadata from original schema:\n"); printf("%s\n", 
schema->metadata()->ToString().c_str()); printf("%s\n", 
schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
schema->field(1)->metadata()->ToString().c_str());
 std::shared_ptr table_schema = table->schema(); 
printf("\nMetadata from schema retrieved from table (should be the same):\n"); 
printf("%s\n", table_schema->metadata()->ToString().c_str()); printf("%s\n", 
table_schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
table_schema->field(1)->metadata()->ToString().c_str());
 // Open file and write table. std::string file_name = "test.parquet"; 
std::shared_ptr ostream; st = 
arrow::io::FileOutputStream::Open(file_name, ); if (!st.ok()) return 1;
 std::unique_ptr writer; 
std::shared_ptr props = 
parquet::default_writer_properties(); st = 
parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, ); if 
(!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if 
(!st.ok()) return 1;
 // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = 
ostream->Close(); if (!st.ok()) return 1;
 /* Read Parquet File 
**/
 // Create new memory pool. Not sure if this is necessary. //arrow::MemoryPool* 
pool2 = arrow::default_memory_pool();
 // Open file reader. std::shared_ptr input_file; st = 
arrow::io::ReadableFile::Open(file_name, pool, _file); if (!st.ok()) 
return 1; std::unique_ptr reader; st = 
parquet::arrow::OpenFile(input_file, pool, ); if (!st.ok()) return 1;
 // Get schema and read metadata. std::shared_ptr new_schema; st 
= reader->GetSchema(_schema); if (!st.ok()) return 1; printf("\nMetadata 
from schema read from file:\n"); printf("%s\n", 
new_schema->metadata()->ToString().c_str());
 // Crashes because there are no metadata. /*printf("%s\n", 
new_schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
new_schema->field(1)->metadata()->ToString().c_str());*/
 printf("field name %s metadata exists: %d\n", 
new_schema->field(0)->name().c_str(), 

[jira] [Updated] (SPARK-29794) Column level compression

2019-11-07 Thread Anirudh Vyas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Vyas updated SPARK-29794:
-
Priority: Minor  (was: Major)

> Column level compression
> 
>
> Key: SPARK-29794
> URL: https://issues.apache.org/jira/browse/SPARK-29794
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Anirudh Vyas
>Priority: Minor
>
> Currently in spark we do not have capability to specify different 
> compressions for different columns, however this capability exists in parquet 
> format for example.
>  
> Not sure if this has been opened before (I am sure it might have been but I 
> cannot find it), hence opening a lane for potential improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29794) Column level compression

2019-11-07 Thread Anirudh Vyas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Vyas updated SPARK-29794:
-
Priority: Major  (was: Minor)

> Column level compression
> 
>
> Key: SPARK-29794
> URL: https://issues.apache.org/jira/browse/SPARK-29794
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Anirudh Vyas
>Priority: Major
>
> Currently in spark we do not have capability to specify different 
> compressions for different columns, however this capability exists in parquet 
> format for example.
>  
> Not sure if this has been opened before (I am sure it might have been but I 
> cannot find it), hence opening a lane for potential improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-11-07 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969615#comment-16969615
 ] 

Shane Knapp commented on SPARK-29106:
-

just uploaded the pip requirements.txt file that i used to get the majority of 
the python tests to run with.

sadly, we will not be able to test against arrow/pyarrow for the foreseeable as 
they're moving to a full conda-forge package solution rather than pip.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29106) Add jenkins arm test for spark

2019-11-07 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp updated SPARK-29106:

Attachment: arm-python36.txt

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-11-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22340.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24898
[https://github.com/apache/spark/pull/24898]

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Mortenson
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-11-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22340:


Assignee: Hyukjin Kwon

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Mortenson
>Assignee: Hyukjin Kwon
>Priority: Major
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29781) Override SBT Jackson-databind dependency like Maven

2019-11-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29781:
-

Assignee: Dongjoon Hyun

> Override SBT Jackson-databind dependency like Maven
> ---
>
> Key: SPARK-29781
> URL: https://issues.apache.org/jira/browse/SPARK-29781
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This is `branch-2.4` only issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29781) Override SBT Jackson-databind dependency like Maven

2019-11-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29781.
---
Fix Version/s: 2.4.5
   Resolution: Fixed

Issue resolved by pull request 26417
[https://github.com/apache/spark/pull/26417]

> Override SBT Jackson-databind dependency like Maven
> ---
>
> Key: SPARK-29781
> URL: https://issues.apache.org/jira/browse/SPARK-29781
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.5
>
>
> This is `branch-2.4` only issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-11-07 Thread venkata yerubandi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969494#comment-16969494
 ] 

venkata yerubandi commented on SPARK-29335:
---

Srini 

Even i am having the issue now , were you able to fix this issue 

-thanks

-venkat 

 

> Cost Based Optimizer stats are not used while evaluating query plans in Spark 
> Sql
> -
>
> Key: SPARK-29335
> URL: https://issues.apache.org/jira/browse/SPARK-29335
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 2.3.0
> Environment: We tried to execute the same using Spark-sql and Thrify 
> server using SQLWorkbench but we are not able to use the stats.
>Reporter: Srini E
>Priority: Major
>  Labels: Question, stack-overflow
> Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few 
> critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>  
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, 
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
> absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not 
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the 
> stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
> 2017),(ptype#4546 = A),(eid#4542 = 
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(absref#4569),(absref#4569 = 
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: 
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
> ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
> cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
> cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- 
> 

[jira] [Resolved] (SPARK-29796) HiveExternalCatalogVersionsSuite` should ignore preview release

2019-11-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29796.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26428
[https://github.com/apache/spark/pull/26428]

> HiveExternalCatalogVersionsSuite` should ignore preview release
> ---
>
> Key: SPARK-29796
> URL: https://issues.apache.org/jira/browse/SPARK-29796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 2.4.5, 3.0.0
>
>
> This issue to exclude the `preview` release to recover 
> `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks 
> `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) 
> should not affect `branch-2.4`.
> - https://github.com/apache/spark/pull/26417 (Failed 4 times)
> {code}
> scala> 
> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString.split("\n").filter(_.contains("""  href="spark-""")).map(""" href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1))
> java.util.NoSuchElementException: None.get
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29796) HiveExternalCatalogVersionsSuite` should ignore preview release

2019-11-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29796:
-

Assignee: Dongjoon Hyun

> HiveExternalCatalogVersionsSuite` should ignore preview release
> ---
>
> Key: SPARK-29796
> URL: https://issues.apache.org/jira/browse/SPARK-29796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>
> This issue to exclude the `preview` release to recover 
> `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks 
> `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) 
> should not affect `branch-2.4`.
> - https://github.com/apache/spark/pull/26417 (Failed 4 times)
> {code}
> scala> 
> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString.split("\n").filter(_.contains("""  href="spark-""")).map(""" href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1))
> java.util.NoSuchElementException: None.get
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29796) HiveExternalCatalogVersionsSuite` should ignore preview release

2019-11-07 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29796:
-

 Summary: HiveExternalCatalogVersionsSuite` should ignore preview 
release
 Key: SPARK-29796
 URL: https://issues.apache.org/jira/browse/SPARK-29796
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.5, 3.0.0
Reporter: Dongjoon Hyun


This issue to exclude the `preview` release to recover 
`HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks 
`branch-2.4` PRBuilder since yesterday. New release (especially `preview`) 
should not affect `branch-2.4`.
- https://github.com/apache/spark/pull/26417 (Failed 4 times)

{code}
scala> 
scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString.split("\n").filter(_.contains("".r.findFirstMatchIn(_).get.group(1))
java.util.NoSuchElementException: None.get
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27990) Provide a way to recursively load data from datasource

2019-11-07 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969462#comment-16969462
 ] 

Nicholas Chammas edited comment on SPARK-27990 at 11/7/19 5:54 PM:
---

Are there any docs for this new {{recursiveFileLookup}} option? I can't find 
anything.


was (Author: nchammas):
Are there any docs for this new option? I can't find anything.

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource

2019-11-07 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969462#comment-16969462
 ] 

Nicholas Chammas commented on SPARK-27990:
--

Are there any docs for this new option? I can't find anything.

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29795) Possible 'leak' of Metrics with dropwizard metrics 4.x

2019-11-07 Thread Sean R. Owen (Jira)
Sean R. Owen created SPARK-29795:


 Summary: Possible 'leak' of Metrics with dropwizard metrics 4.x
 Key: SPARK-29795
 URL: https://issues.apache.org/jira/browse/SPARK-29795
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


This one's a little complex to explain.

SPARK-29674 updated dropwizard metrics to 4.x, for Spark 3 only. That appears 
to be fine, according to tests. We have not and do not intend to backport it to 
2.4.x.

However, I'm working with a few people trying to back-port this to Spark 2.4.x 
separately. When this update is applied, tests fail readily with 
OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A heap 
dump analysis shows that MetricRegistry objects are retaining a gigabyte or 
more of memory.

It appears to be holding references to many large internal Spark objects like 
BlockManager and Netty objects, via closures we pass to Gauge objects. Although 
it looked odd, this may or may not be an issue; in normal usage where a JVM 
hosts one SparkContext, this may normal.

However in tests where contexts are started/restarted repeatedly, it seems like 
this might 'leak' old references to old context-related objects across runs via 
metrics. I don't have a clear theory on how yet (is SparkEnv shared or some ref 
held to it?), besides the empirical evidence. However, it's also not clear why 
this wouldn't affect Spark 3, apparently, as tests work fine. It could be 
another fix in Spark 3 that happens to help here; it could be that Spark 3 uses 
less memory and never hits the issue.

Despite that uncertainty, I've found that simply clearing the registered 
metrics from MetricsSystem when it is stop()-ped seems to resolve the issue. At 
this point, Spark is shutting down and sinks have stopped, so there doesn't 
seem to be any harm in manually releasing all registered metrics and objects. I 
don't _think_ it's intended to track metrics across two instantiations of a 
SparkContext in the same JVM, but that's a question.

That's the change I will propose in a PR.

Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any test 
failures like this in 2.4 or reports of problems with metrics-related memory 
pressure. It could be a change in how 4.x behaves, tracks objects, manages 
lifecycles.

The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works 
fine on both without the 4.x update; runs out of memory on both with the change.

Why do this if this only affects 2.4 + metrics 4.x and we're not moving to 
metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not detected 
by tests. It may help apps that do for various reasons run multiple 
SparkContexts per JVM - like many other project test suites. It may just be 
good for tidiness in shutdown, to manually clear resources.

Therefore I can't call this a bug per se, maybe an improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29794) Column level compression

2019-11-07 Thread Anirudh Vyas (Jira)
Anirudh Vyas created SPARK-29794:


 Summary: Column level compression
 Key: SPARK-29794
 URL: https://issues.apache.org/jira/browse/SPARK-29794
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Anirudh Vyas


Currently in spark we do not have capability to specify different compressions 
for different columns, however this capability exists in parquet format for 
example.

 

Not sure if this has been opened before (I am sure it might have been but I 
cannot find it), hence opening a lane for potential improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29793) Display plan evolve history in AQE

2019-11-07 Thread Wei Xue (Jira)
Wei Xue created SPARK-29793:
---

 Summary: Display plan evolve history in AQE
 Key: SPARK-29793
 URL: https://issues.apache.org/jira/browse/SPARK-29793
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue


To have a way to present the entire evolve history of an AQE plan. This can be 
done in two stages:
 # Enable an option for printing the plan changes on the go.
 # Display history plans in Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29792:
---

Assignee: Ke Jia

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-07 Thread Wei Xue (Jira)
Wei Xue created SPARK-29792:
---

 Summary: SQL metrics cannot be updated to subqueries in AQE
 Key: SPARK-29792
 URL: https://issues.apache.org/jira/browse/SPARK-29792
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29791) Add a spark config to allow user to use executor cores virtually.

2019-11-07 Thread zengrui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zengrui updated SPARK-29791:

Description: We can config the executor cores by "spark.executor.cores". 
For example, if we config 8 cores for a executor, then the driver can only 
scheduler 8 tasks to this executor concurrently. In fact, most cases a task 
does not always occupy a core or more. More time, tasks spent on disk IO or 
network IO, so we can make driver to scheduler more than 8 tasks(virtual the 
cores to 16,32 or more the executor report to driver) to this executor 
concurrently, it will make the whole job execute more quickly.  (was: We can 
config the executor cores by "spark.executor.cores". For example, if we config 
8 cores for a executor, then the driver can only scheduler 8 tasks to this 
executor concurrently. In fact, most cases a task does not always occupy a core 
or more. More time, tasks spent on disk IO or network IO, so we can make driver 
to scheduler more than 8 tasks to this executor concurrently, it will make the 
whole job execute more quickly.)

> Add a spark config to allow user to use executor cores virtually.
> -
>
> Key: SPARK-29791
> URL: https://issues.apache.org/jira/browse/SPARK-29791
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: zengrui
>Priority: Minor
>
> We can config the executor cores by "spark.executor.cores". For example, if 
> we config 8 cores for a executor, then the driver can only scheduler 8 tasks 
> to this executor concurrently. In fact, most cases a task does not always 
> occupy a core or more. More time, tasks spent on disk IO or network IO, so we 
> can make driver to scheduler more than 8 tasks(virtual the cores to 16,32 or 
> more the executor report to driver) to this executor concurrently, it will 
> make the whole job execute more quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29791) Add a spark config to allow user to use executor cores virtually.

2019-11-07 Thread zengrui (Jira)
zengrui created SPARK-29791:
---

 Summary: Add a spark config to allow user to use executor cores 
virtually.
 Key: SPARK-29791
 URL: https://issues.apache.org/jira/browse/SPARK-29791
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: zengrui


We can config the executor cores by "spark.executor.cores". For example, if we 
config 8 cores for a executor, then the driver can only scheduler 8 tasks to 
this executor concurrently. In fact, most cases a task does not always occupy a 
core or more. More time, tasks spent on disk IO or network IO, so we can make 
driver to scheduler more than 8 tasks to this executor concurrently, it will 
make the whole job execute more quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-07 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969301#comment-16969301
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

[~hyukjin.kwon] Could you please help me with this?

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
>  ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
> task 0.0 in stage 0.0 (TID 0, 

[jira] [Commented] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version

2019-11-07 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969279#comment-16969279
 ] 

Yuming Wang commented on SPARK-29784:
-

!image-2019-11-07-22-06-48-156.png!

> Built in function trim is not compatible in 3.0 with previous version
> -
>
> Key: SPARK-29784
> URL: https://issues.apache.org/jira/browse/SPARK-29784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: image-2019-11-07-22-06-48-156.png
>
>
> SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 
> and 2.3.2 is returning after leading and trailing character removed.
> Spark 3.0 – Not correct
> jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+
> | trim(SL, SSparkSQLS) |
> +---+
> | |
> +---
> Spark 2.4 – Correct
> jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+--+
> | trim(SSparkSQLS, SL) |
> +---+--+
> | parkSQ |
> +---+--+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version

2019-11-07 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29784:

Attachment: image-2019-11-07-22-06-48-156.png

> Built in function trim is not compatible in 3.0 with previous version
> -
>
> Key: SPARK-29784
> URL: https://issues.apache.org/jira/browse/SPARK-29784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: image-2019-11-07-22-06-48-156.png
>
>
> SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 
> and 2.3.2 is returning after leading and trailing character removed.
> Spark 3.0 – Not correct
> jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+
> | trim(SL, SSparkSQLS) |
> +---+
> | |
> +---
> Spark 2.4 – Correct
> jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+--+
> | trim(SSparkSQLS, SL) |
> +---+--+
> | parkSQ |
> +---+--+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version

2019-11-07 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969272#comment-16969272
 ] 

Yuming Wang commented on SPARK-29784:
-

Thank you [~pavithraramachandran].  We changed our non-standard syntax of trim 
function in https://github.com/apache/spark/pull/24902 from {{TRIM(trimStr, 
str)}} to {{TRIM(str, trimStr)}} to be compatible with other databases.


> Built in function trim is not compatible in 3.0 with previous version
> -
>
> Key: SPARK-29784
> URL: https://issues.apache.org/jira/browse/SPARK-29784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 
> and 2.3.2 is returning after leading and trailing character removed.
> Spark 3.0 – Not correct
> jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+
> | trim(SL, SSparkSQLS) |
> +---+
> | |
> +---
> Spark 2.4 – Correct
> jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+--+
> | trim(SSparkSQLS, SL) |
> +---+--+
> | parkSQ |
> +---+--+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29790) Add notes about port being required for Kubernetes API URL when set as master

2019-11-07 Thread Jira
Emil Sandstø created SPARK-29790:


 Summary: Add notes about port being required for Kubernetes API 
URL when set as master
 Key: SPARK-29790
 URL: https://issues.apache.org/jira/browse/SPARK-29790
 Project: Spark
  Issue Type: Documentation
  Components: Kubernetes
Affects Versions: 2.4.4, 2.4.3
Reporter: Emil Sandstø


Apparently, when configuring the master endpoint, the Kubernetes API url needs 
to include the port, even if it's 443. This should be noted in the 
documentation of "Running Spark on Kubernetes guide".

Reported in the wild 
[https://medium.com/@kidane.weldemariam_75349/thanks-james-on-issuing-spark-submit-i-run-into-this-error-cc507d4f8f0d]

I had the same issue myself.

We might want to create an issue for fixing the implementation, if it's 
considered a bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25466) Documentation does not specify how to set Kafka consumer cache capacity for SS

2019-11-07 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969262#comment-16969262
 ] 

Gabor Somogyi commented on SPARK-25466:
---

Well, if you think there is a bug in Spark please collect the driver + executor 
logs, open a new bug jira and attach it.
Having partial code on Stack Overflow which is half way DStreams, half way 
Structured Streaming and it was originally throwing exception by default 
doesn't help.

> Documentation does not specify how to set Kafka consumer cache capacity for SS
> --
>
> Key: SPARK-25466
> URL: https://issues.apache.org/jira/browse/SPARK-25466
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Patrick McGloin
>Priority: Minor
>
> When hitting this warning with SS:
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30)
> If you Google you get to this page:
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> Which is for Spark Streaming and says to use this config item to adjust the 
> capacity: "spark.streaming.kafka.consumer.cache.maxCapacity".
> This is a bit confusing as SS uses a different config item: 
> "spark.sql.kafkaConsumerCache.capacity"
> Perhaps the SS Kafka documentation should talk about the consumer cache 
> capacity?  Perhaps here?
> https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
> Or perhaps the warning message should reference the config item.  E.g
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30).
>   *The cache size can be adjusted with the setting 
> "spark.sql.kafkaConsumerCache.capacity".*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29789) should not parse the bucket column name again when creating v2 tables

2019-11-07 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-29789:
---

 Summary: should not parse the bucket column name again when 
creating v2 tables
 Key: SPARK-29789
 URL: https://issues.apache.org/jira/browse/SPARK-29789
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29788) Fix the typo errors in the SQL reference document merges

2019-11-07 Thread jobit mathew (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969226#comment-16969226
 ] 

jobit mathew commented on SPARK-29788:
--

I will work on this.

> Fix the typo errors in the SQL reference document merges
> 
>
> Key: SPARK-29788
> URL: https://issues.apache.org/jira/browse/SPARK-29788
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Fix the typo errors in the SQL reference document merges.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29788) Fix the typo errors in the SQL reference document merges

2019-11-07 Thread jobit mathew (Jira)
jobit mathew created SPARK-29788:


 Summary: Fix the typo errors in the SQL reference document merges
 Key: SPARK-29788
 URL: https://issues.apache.org/jira/browse/SPARK-29788
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: jobit mathew


Fix the typo errors in the SQL reference document merges.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29757) Move calendar interval constants together

2019-11-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29757.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26399
[https://github.com/apache/spark/pull/26399]

> Move calendar interval constants together
> -
>
> Key: SPARK-29757
> URL: https://issues.apache.org/jira/browse/SPARK-29757
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code:java}
> public static final Byte DAYS_PER_MONTH = 30;
>   public static final Byte MONTHS_PER_QUARTER = 3;
>   public static final int MONTHS_PER_YEAR = 12;
>   public static final int YEARS_PER_MILLENNIUM = 1000;
>   public static final int YEARS_PER_CENTURY = 100;
>   public static final int YEARS_PER_DECADE = 10;
>   public static final long NANOS_PER_MICRO = 1000L;
>   public static final long MILLIS_PER_SECOND = 1000L;
>   public static final long MICROS_PER_MILLI = 1000L;
>   public static final long SECONDS_PER_DAY = 24L * 60 * 60;
>   public static final long MILLIS_PER_MINUTE = 60 * MILLIS_PER_SECOND;
>   public static final long MILLIS_PER_HOUR = 60 * MILLIS_PER_MINUTE;
>   public static final long MILLIS_PER_DAY = SECONDS_PER_DAY * 
> MILLIS_PER_SECOND;
>   public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * 
> MICROS_PER_MILLI;
>   public static final long MICROS_PER_MINUTE = MILLIS_PER_MINUTE * 
> MICROS_PER_MILLI;
>   public static final long MICROS_PER_HOUR = MILLIS_PER_HOUR * 
> MICROS_PER_MILLI;
>   public static final long MICROS_PER_DAY = SECONDS_PER_DAY * 
> MICROS_PER_SECOND;
>   public static final long MICROS_PER_WEEK = MICROS_PER_DAY * 7;
>   public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY;
>   /* 365.25 days per year assumes leap year every four years */
>   public static final long MICROS_PER_YEAR = (36525L * MICROS_PER_DAY) / 100;
>   public static final long NANOS_PER_SECOND = NANOS_PER_MICRO * 
> MICROS_PER_SECOND;
>   public static final long NANOS_PER_MILLIS = NANOS_PER_MICRO * 
> MICROS_PER_MILLI;
> {code}
> The above parameters are defined in IntervalUtils, DateTimeUtils, and 
> CalendarInterval, some of them are redundant, some of them are 
> cross-referenced. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29757) Move calendar interval constants together

2019-11-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29757:
---

Assignee: Kent Yao

> Move calendar interval constants together
> -
>
> Key: SPARK-29757
> URL: https://issues.apache.org/jira/browse/SPARK-29757
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> {code:java}
> public static final Byte DAYS_PER_MONTH = 30;
>   public static final Byte MONTHS_PER_QUARTER = 3;
>   public static final int MONTHS_PER_YEAR = 12;
>   public static final int YEARS_PER_MILLENNIUM = 1000;
>   public static final int YEARS_PER_CENTURY = 100;
>   public static final int YEARS_PER_DECADE = 10;
>   public static final long NANOS_PER_MICRO = 1000L;
>   public static final long MILLIS_PER_SECOND = 1000L;
>   public static final long MICROS_PER_MILLI = 1000L;
>   public static final long SECONDS_PER_DAY = 24L * 60 * 60;
>   public static final long MILLIS_PER_MINUTE = 60 * MILLIS_PER_SECOND;
>   public static final long MILLIS_PER_HOUR = 60 * MILLIS_PER_MINUTE;
>   public static final long MILLIS_PER_DAY = SECONDS_PER_DAY * 
> MILLIS_PER_SECOND;
>   public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * 
> MICROS_PER_MILLI;
>   public static final long MICROS_PER_MINUTE = MILLIS_PER_MINUTE * 
> MICROS_PER_MILLI;
>   public static final long MICROS_PER_HOUR = MILLIS_PER_HOUR * 
> MICROS_PER_MILLI;
>   public static final long MICROS_PER_DAY = SECONDS_PER_DAY * 
> MICROS_PER_SECOND;
>   public static final long MICROS_PER_WEEK = MICROS_PER_DAY * 7;
>   public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY;
>   /* 365.25 days per year assumes leap year every four years */
>   public static final long MICROS_PER_YEAR = (36525L * MICROS_PER_DAY) / 100;
>   public static final long NANOS_PER_SECOND = NANOS_PER_MICRO * 
> MICROS_PER_SECOND;
>   public static final long NANOS_PER_MILLIS = NANOS_PER_MICRO * 
> MICROS_PER_MILLI;
> {code}
> The above parameters are defined in IntervalUtils, DateTimeUtils, and 
> CalendarInterval, some of them are redundant, some of them are 
> cross-referenced. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.

2019-11-07 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969167#comment-16969167
 ] 

Ankit Raj Boudh commented on SPARK-29760:
-

@Sean R. Owen , i will raise PR for this.

> Document VALUES statement in SQL Reference.
> ---
>
> Key: SPARK-29760
> URL: https://issues.apache.org/jira/browse/SPARK-29760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> spark-sql also supports *VALUES *.
> {code:java}
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three');
> 1   one
> 2   two
> 3   three
> Time taken: 0.015 seconds, Fetched 3 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2;
> 1   one
> 2   two
> Time taken: 0.014 seconds, Fetched 2 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2;
> 1   one
> 3   three
> 2   two
> Time taken: 0.153 seconds, Fetched 3 row(s)
> spark-sql>
> {code}
> or even *values *can be used along with INSERT INTO or select.
> refer: https://www.postgresql.org/docs/current/sql-values.html 
> So please confirm VALUES also can be documented or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29787) Move method add/subtract/negate from CalendarInterval to IntervalUtils

2019-11-07 Thread Kent Yao (Jira)
Kent Yao created SPARK-29787:


 Summary: Move method add/subtract/negate from CalendarInterval to 
IntervalUtils
 Key: SPARK-29787
 URL: https://issues.apache.org/jira/browse/SPARK-29787
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Move these tool methods to utils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org