[jira] [Commented] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231676#comment-15231676 ] Reynold Xin commented on SPARK-14480: - Please go ahead! > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think > is made like this for better performance. However, it looks there are two > problems. > Firstly, it was actually not faster than processing line by line with > {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. > Secondly, this brought a bit of complexity because it needs additional logics > to allow every line to be read bytes by bytes. So, it was pretty difficult to > figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes > in {{CSVParser}} might not be needed. > I made a rough patch and tested this. The test results for the first problem > are below: > h4. Results > - Original codes with {{Reader}} wrapping {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 14116265034 | 2008277960 | > - New codes with {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 13451699644 | 1549050564 | > In more details, > h4. Method > - TCP-H lineitem table is being tested. > - The results are collected only by 100 due to the lack of resources. > - End-to-end tests and parsing time tests are performed 10 times and averages > are calculated for each. > h4. Environment > - Machine: MacBook Pro Retina > - CPU: 4 > - Memory: 8GB > h4. Dataset > - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 > ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) > - Size : 724.66 MB > h4. Test Codes > - Function to measure time > {code} > def time[A](f: => A) = { > val s = System.nanoTime > val ret = f > println("time: "+(System.nanoTime-s)/1e6+"ms") > ret > } > {code} > - End-to-end test > {code} > val path = "lineitem.tbl" > val df = sqlContext > .read > .format("csv") > .option("header", "false") > .option("delimiter", "|") > .load(path) > time(df.take(100)) > {code} > - Parsing time test for original (in {{BulkCsvParser}}) > {code} > ... > // `reader` is a wrapper for an Iterator. > private val reader = new StringIteratorReader(iter) > parser.beginParsing(reader) > ... > time(parser.parseNext()) > ... > {code} > - Parsing time test for new (in {{BulkCsvParser}}) > {code} > ... > time(parser.parseLine(iter.next())) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14480: - Description: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for an Iterator. private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} was: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for an Iterator. private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement > Components: SQL >
[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14480: - Description: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for an Iterator. private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} was: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for an Iterator. private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement >
[jira] [Commented] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231674#comment-15231674 ] Hyukjin Kwon commented on SPARK-14480: -- [~rxin] [~srowen] Could I maybe try to open a PR for this first? I think codes would give a clearer view. > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think > is made like this for better performance. However, it looks there are two > problems. > Firstly, it was actually not faster than processing line by line with > {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. > Secondly, this brought a bit of complexity because it needs additional logics > to allow every line to be read bytes by bytes. So, it was pretty difficult to > figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes > in {{CSVParser}} might not be needed. > I made a rough patch and tested this. The test results for the first problem > are below: > h4. Results > - Original codes with {{Reader}} wrapping {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 14116265034 | 2008277960 | > - New codes with {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 13451699644 | 1549050564 | > In more details, > h4. Method > - TCP-H lineitem table is being tested. > - The results are collected only by 100 due to the lack of resources. > - End-to-end tests and parsing time tests are performed 10 times and averages > are calculated for each. > h4. Environment > - Machine: MacBook Pro Retina > - CPU: 4 > - Memory: 8GB > h4. Dataset > - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 > ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) > - Size : 724.66 MB > h4. Test Codes > - Function to measure time > {code} > def time[A](f: => A) = { > val s = System.nanoTime > val ret = f > println("time: "+(System.nanoTime-s)/1e6+"ms") > ret > } > {code} > - End-to-end test > {code} > val path = "lineitem.tbl" > val df = sqlContext > .read > .format("csv") > .option("header", "false") > .option("delimiter", "|") > .load(path) > time(df.take(100)) > {code} > - Parsing time test for original (in {{BulkCsvParser}}) > {code} > ... > // `reader` is a wrapper for an Iterator. > private val reader = new StringIteratorReader(iter) > parser.beginParsing(reader) > ... > time(parser.parseNext()) > ... > {code} > - Parsing time test for new (in {{BulkCsvParser}}) > {code} > ... > time(parser.parseLine(iter.next())) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14480: - Description: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for an Iterator. private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} was: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for an Iterator. private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement >
[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14480: - Description: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for {{Iterator}} private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} was: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(filteredIter.next())) ... {code} > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source
[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14480: - Description: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for an Iterator. private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} was: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... // `reader` is a wrapper for {{Iterator}} private val reader = new StringIteratorReader(iter) parser.beginParsing(reader) ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(iter.next())) ... {code} > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement >
[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14480: - Description: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(filteredIter.next())) ... {code} was: Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(filteredIter.next())) ... {code} > Simplify CSV parsing process with a better performance > --- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}}
[jira] [Created] (SPARK-14480) Simplify CSV parsing process with a better performance
Hyukjin Kwon created SPARK-14480: Summary: Simplify CSV parsing process with a better performance Key: SPARK-14480 URL: https://issues.apache.org/jira/browse/SPARK-14480 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line). In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems. Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed. I made a rough patch and tested this. The test results for the first problem are below: h4. Results - Original codes with {{Reader}} wrapping {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 14116265034 | 2008277960 | - New codes with {{Iterator}} ||End-to-end (ns)||Parse Time (ns)|| | 13451699644 | 1549050564 | In more details, h4. Method - TCP-H lineitem table is being tested. - The results are collected only by 100 due to the lack of resources. - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each. h4. Environment - Machine: MacBook Pro Retina - CPU: 4 - Memory: 8GB h4. Dataset - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] - Size : 724.66 MB h4. Test Codes - Function to measure time {code} def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } {code} - End-to-end test {code} val path = "lineitem.tbl" val df = sqlContext .read .format("csv") .option("header", "false") .option("delimiter", "|") .load(path) time(df.take(100)) {code} - Parsing time test for original (in {{BulkCsvParser}}) {code} ... time(parser.parseNext()) ... {code} - Parsing time test for new (in {{BulkCsvParser}}) {code} ... time(parser.parseLine(filteredIter.next())) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14375) Unit test for spark.ml KMeansSummary
[ https://issues.apache.org/jira/browse/SPARK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14375: Assignee: Apache Spark > Unit test for spark.ml KMeansSummary > > > Key: SPARK-14375 > URL: https://issues.apache.org/jira/browse/SPARK-14375 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > There is no unit test for KMeansSummary in spark.ml. > Other items which could be fixed here: > * Add Since version to KMeansSummary class > * Modify clusterSizes method to match GMM method, to be robust to empty > clusters (in case we support that sometime) (See PR for [SPARK-13538]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14375) Unit test for spark.ml KMeansSummary
[ https://issues.apache.org/jira/browse/SPARK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231664#comment-15231664 ] Apache Spark commented on SPARK-14375: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/12254 > Unit test for spark.ml KMeansSummary > > > Key: SPARK-14375 > URL: https://issues.apache.org/jira/browse/SPARK-14375 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > There is no unit test for KMeansSummary in spark.ml. > Other items which could be fixed here: > * Add Since version to KMeansSummary class > * Modify clusterSizes method to match GMM method, to be robust to empty > clusters (in case we support that sometime) (See PR for [SPARK-13538]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14375) Unit test for spark.ml KMeansSummary
[ https://issues.apache.org/jira/browse/SPARK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14375: Assignee: (was: Apache Spark) > Unit test for spark.ml KMeansSummary > > > Key: SPARK-14375 > URL: https://issues.apache.org/jira/browse/SPARK-14375 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > There is no unit test for KMeansSummary in spark.ml. > Other items which could be fixed here: > * Add Since version to KMeansSummary class > * Modify clusterSizes method to match GMM method, to be robust to empty > clusters (in case we support that sometime) (See PR for [SPARK-13538]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin
[ https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231655#comment-15231655 ] Yan commented on SPARK-14389: - Actually the current Master branch does not have the issue; while 1.6.0 has. There appear to be improvements on BNL join since 1.6, Spark-13213 in particular. > OOM during BroadcastNestedLoopJoin > -- > > Key: SPARK-14389 > URL: https://issues.apache.org/jira/browse/SPARK-14389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OS: Amazon Linux AMI 2015.09 > EMR: 4.3.0 > Hadoop: Amazon 2.7.1 > Spark 1.6.0 > Ganglia 3.7.2 > Master: m3.xlarge > Core: m3.xlarge > m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD >Reporter: Steve Johnston > Attachments: jps_command_results.txt, lineitem.tbl, plans.txt, > sample_script.py, stdout.txt > > > When executing attached sample_script.py in client mode with a single > executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", > during the self join of a small table, TPC-H lineitem generated for a 1M > dataset. Also see execution log stdout.txt attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14461) GLM training summaries should provide solver
[ https://issues.apache.org/jira/browse/SPARK-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14461: Assignee: Apache Spark > GLM training summaries should provide solver > > > Key: SPARK-14461 > URL: https://issues.apache.org/jira/browse/SPARK-14461 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > GLM training summaries have different types of metrics available depending on > the solver used during training. In the summaries, we should provide the > solver used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14461) GLM training summaries should provide solver
[ https://issues.apache.org/jira/browse/SPARK-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14461: Assignee: (was: Apache Spark) > GLM training summaries should provide solver > > > Key: SPARK-14461 > URL: https://issues.apache.org/jira/browse/SPARK-14461 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > GLM training summaries have different types of metrics available depending on > the solver used during training. In the summaries, we should provide the > solver used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14461) GLM training summaries should provide solver
[ https://issues.apache.org/jira/browse/SPARK-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231618#comment-15231618 ] Apache Spark commented on SPARK-14461: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/12253 > GLM training summaries should provide solver > > > Key: SPARK-14461 > URL: https://issues.apache.org/jira/browse/SPARK-14461 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > GLM training summaries have different types of metrics available depending on > the solver used during training. In the summaries, we should provide the > solver used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14127) [Table related commands] Describe table
[ https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231616#comment-15231616 ] Xiao Li commented on SPARK-14127: - {noformat} # Partition Information # col_name data_type comment {noformat} Will be two rows. Will not have empty rows. > [Table related commands] Describe table > --- > > Key: SPARK-14127 > URL: https://issues.apache.org/jira/browse/SPARK-14127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > TOK_DESCTABLE > Describe a column/table/partition (see here and here). Seems we support > DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other > syntaxes (and check if we are missing anything). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14479) GLM predict type should be link or response?
[ https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14479: Component/s: SparkR > GLM predict type should be link or response? > > > Key: SPARK-14479 > URL: https://issues.apache.org/jira/browse/SPARK-14479 > Project: Spark > Issue Type: Question > Components: ML, SparkR >Reporter: Yanbo Liang > > In R glm and glmnet, the default type of predict is "link" which is the > linear predictor, users can specify "type = response" to output response > prediction. Currently the ML glm predict will output "response" prediction by > default, I think it's more reasonable. Should we change the default type of > ML glm predict output? > R glm: > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html > R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet > Meanwhile, we should decide the default type of glm predict output in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14479) GLM predict type should be link or response?
[ https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14479: Description: In R glm and glmnet, the default type of predict is "link" which is the linear predictor, users can specify "type = response" to output response prediction. Currently the ML glm predict will output "response" prediction by default, I think it's more reasonable. Should we change the default type of ML glm predict output? R glm: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet Meanwhile, we should decide the default type of glm predict output in SparkR. was: In R glm and glmnet, the default type of predict is "link" which is the linear predictor, users can specify "type = response" to output response prediction. Currently the ML glm predict will output "response" prediction by default, I think it's more reasonable. Should we change the default type of ML glm predict output? R glm: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet > GLM predict type should be link or response? > > > Key: SPARK-14479 > URL: https://issues.apache.org/jira/browse/SPARK-14479 > Project: Spark > Issue Type: Question > Components: ML >Reporter: Yanbo Liang > > In R glm and glmnet, the default type of predict is "link" which is the > linear predictor, users can specify "type = response" to output response > prediction. Currently the ML glm predict will output "response" prediction by > default, I think it's more reasonable. Should we change the default type of > ML glm predict output? > R glm: > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html > R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet > Meanwhile, we should decide the default type of glm predict output in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14479) GLM predict type should be link or response?
[ https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231604#comment-15231604 ] Yanbo Liang commented on SPARK-14479: - This will introduce break change, so it's better make decision before Spark 2.0. cc [~mengxr] [~josephkb] > GLM predict type should be link or response? > > > Key: SPARK-14479 > URL: https://issues.apache.org/jira/browse/SPARK-14479 > Project: Spark > Issue Type: Question > Components: ML >Reporter: Yanbo Liang > > In R glm and glmnet, the default type of predict is "link" which is the > linear predictor, users can specify "type = response" to output response > prediction. Currently the ML glm predict will output "response" prediction by > default, I think it's more reasonable. Should we change the default type of > ML glm predict output? > R glm: > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html > R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14479) GLM predict type should be link or response?
[ https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14479: Description: In R glm and glmnet, the default type of predict is "link" which is the linear predictor, users can specify "type = response" to output response prediction. Currently the ML glm predict will output "response" prediction by default, I think it's more reasonable. Should we change the default type of ML glm predict output? R glm: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet was:In R glm and glmnet, the default type of predict is "link" which is the linear predictor, users can specify "type = response" to output response prediction. Currently the ML glm predict will output "response" prediction by default, I think it's more reasonable. Should we change the default type of ML glm predict output? > GLM predict type should be link or response? > > > Key: SPARK-14479 > URL: https://issues.apache.org/jira/browse/SPARK-14479 > Project: Spark > Issue Type: Question > Components: ML >Reporter: Yanbo Liang > > In R glm and glmnet, the default type of predict is "link" which is the > linear predictor, users can specify "type = response" to output response > prediction. Currently the ML glm predict will output "response" prediction by > default, I think it's more reasonable. Should we change the default type of > ML glm predict output? > R glm: > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html > R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14479) GLM predict type should be link or response?
Yanbo Liang created SPARK-14479: --- Summary: GLM predict type should be link or response? Key: SPARK-14479 URL: https://issues.apache.org/jira/browse/SPARK-14479 Project: Spark Issue Type: Question Components: ML Reporter: Yanbo Liang In R glm and glmnet, the default type of predict is "link" which is the linear predictor, users can specify "type = response" to output response prediction. Currently the ML glm predict will output "response" prediction by default, I think it's more reasonable. Should we change the default type of ML glm predict output? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14127) [Table related commands] Describe table
[ https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231600#comment-15231600 ] Xiao Li commented on SPARK-14127: - {noformat} hive> create table ptestfilter (a string, b int) partitioned by (c string, d string); OK Time taken: 1.464 seconds hive> > describe ptestfilter; OK a string b int c string d string # Partition Information # col_name data_type comment c string d string Time taken: 0.449 seconds, Fetched: 10 row(s) {noformat} > [Table related commands] Describe table > --- > > Key: SPARK-14127 > URL: https://issues.apache.org/jira/browse/SPARK-14127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > TOK_DESCTABLE > Describe a column/table/partition (see here and here). Seems we support > DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other > syntaxes (and check if we are missing anything). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231585#comment-15231585 ] Davies Liu commented on SPARK-8632: --- [~bijay697] Python UDFs had been improved a lot recently in master, see https://issues.apache.org/jira/browse/SPARK-14267 and https://issues.apache.org/jira/browse/SPARK-14215. Could you try master ? > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231579#comment-15231579 ] Yanbo Liang commented on SPARK-14378: - I can work on it. > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names
[ https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231570#comment-15231570 ] Apache Spark commented on SPARK-14460: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12252 > DataFrameWriter JDBC doesn't Quote/Escape column names > -- > > Key: SPARK-14460 > URL: https://issues.apache.org/jira/browse/SPARK-14460 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Sean Rose > Labels: easyfix > > When I try to write a DataFrame which contains a column with a space in it > ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect > syntax near 'Address' > I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping > column names. JdbcDialect has the "quoteIdentifier" method, which could be > called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names
[ https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14460: Assignee: Apache Spark > DataFrameWriter JDBC doesn't Quote/Escape column names > -- > > Key: SPARK-14460 > URL: https://issues.apache.org/jira/browse/SPARK-14460 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Sean Rose >Assignee: Apache Spark > Labels: easyfix > > When I try to write a DataFrame which contains a column with a space in it > ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect > syntax near 'Address' > I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping > column names. JdbcDialect has the "quoteIdentifier" method, which could be > called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names
[ https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14460: Assignee: (was: Apache Spark) > DataFrameWriter JDBC doesn't Quote/Escape column names > -- > > Key: SPARK-14460 > URL: https://issues.apache.org/jira/browse/SPARK-14460 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Sean Rose > Labels: easyfix > > When I try to write a DataFrame which contains a column with a space in it > ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect > syntax near 'Address' > I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping > column names. JdbcDialect has the "quoteIdentifier" method, which could be > called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names
[ https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231077#comment-15231077 ] Bo Meng edited comment on SPARK-14460 at 4/8/16 3:04 AM: - Thanks [~srose03] for finding the root cause - That makes the fix easier.I will post the fix shortly. was (Author: bomeng): I can take a look. Thanks. > DataFrameWriter JDBC doesn't Quote/Escape column names > -- > > Key: SPARK-14460 > URL: https://issues.apache.org/jira/browse/SPARK-14460 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Sean Rose > Labels: easyfix > > When I try to write a DataFrame which contains a column with a space in it > ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect > syntax near 'Address' > I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping > column names. JdbcDialect has the "quoteIdentifier" method, which could be > called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14403) the dag of a stage may has too many same chid cluster, and result to gc
[ https://issues.apache.org/jira/browse/SPARK-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula closed SPARK-14403. Resolution: Resolved > the dag of a stage may has too many same chid cluster, and result to gc > --- > > Key: SPARK-14403 > URL: https://issues.apache.org/jira/browse/SPARK-14403 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: meiyoula > > When I run a sql query, I can't open the stage page on the web, and result > the historyserver process shut down. > After debug the code, I find a stage graph has more than 5000 same > childcluster. so when going to make dot file, the process goes down. > I think the graph cluster shouldn't has the same child cluster, right? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
[ https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-13048. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12166 [https://github.com/apache/spark/pull/12166] > EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel > -- > > Key: SPARK-13048 > URL: https://issues.apache.org/jira/browse/SPARK-13048 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 > Environment: Standalone Spark cluster >Reporter: Jeff Stein >Assignee: Joseph K. Bradley > Fix For: 2.0.0 > > > In EMLDAOptimizer, all checkpoints are deleted before returning the > DistributedLDAModel. > The most recent checkpoint is still necessary for operations on the > DistributedLDAModel under a couple scenarios: > - The graph doesn't fit in memory on the worker nodes (e.g. very large data > set). > - Late worker failures that require reading the now-dependent checkpoint. > I ran into this problem running a 10M record LDA model in a memory starved > environment. The model consistently failed in either the {{collect at > LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the > {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the > model). In both cases, a FileNotFoundException is thrown attempting to access > a checkpoint file. > I'm not sure what the correct fix is here; it might involve a class signature > change. An alternative simple fix is to leave the last checkpoint around and > expect the user to clean the checkpoint directory themselves. > {noformat} > java.io.FileNotFoundException: File does not exist: > /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071 > {noformat} > Relevant code is included below. > LDAOptimizer.scala: > {noformat} > override private[clustering] def getLDAModel(iterationTimes: > Array[Double]): LDAModel = { > require(graph != null, "graph is null, EMLDAOptimizer not initialized.") > this.graphCheckpointer.deleteAllCheckpoints() > // The constructor's default arguments assume gammaShape = 100 to ensure > equivalence in > // LDAModel.toLocal conversion > new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, > this.vocabSize, > Vectors.dense(Array.fill(this.k)(this.docConcentration)), > this.topicConcentration, > iterationTimes) > } > {noformat} > PeriodicCheckpointer.scala > {noformat} > /** >* Call this at the end to delete any remaining checkpoint files. >*/ > def deleteAllCheckpoints(): Unit = { > while (checkpointQueue.nonEmpty) { > removeCheckpointFile() > } > } > /** >* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files. >* This prints a warning but does not fail if the files cannot be removed. >*/ > private def removeCheckpointFile(): Unit = { > val old = checkpointQueue.dequeue() > // Since the old checkpoint is not deleted by Spark, we manually delete > it. > val fs = FileSystem.get(sc.hadoopConfiguration) > getCheckpointFiles(old).foreach { checkpointFile => > try { > fs.delete(new Path(checkpointFile), true) > } catch { > case e: Exception => > logWarning("PeriodicCheckpointer could not remove old checkpoint > file: " + > checkpointFile) > } > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13448: -- Description: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by default, if checkpointing is being used. was: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results > Document MLlib behavior changes in Spark 2.0 > > > Key: SPARK-13448 > URL: https://issues.apache.org/jira/browse/SPARK-13448 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can > remember to add them to the migration guide / release notes. > * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 > to 1e-6. > * SPARK-7780: Intercept will not be regularized if users train binary > classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, > because it calls ML LogisticRegresson implementation. Meanwhile if users set > without regularization, training with or without feature scaling will return > the same solution by the same convergence rate(because they run the same code > route), this behavior is different from the old API. > * SPARK-12363: Bug fix for PowerIterationClustering which will likely change > results > * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by > default, if checkpointing is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14478) Should StandardScaler use biased variance to scale?
Joseph K. Bradley created SPARK-14478: - Summary: Should StandardScaler use biased variance to scale? Key: SPARK-14478 URL: https://issues.apache.org/jira/browse/SPARK-14478 Project: Spark Issue Type: Question Components: ML, MLlib Reporter: Joseph K. Bradley Currently, MLlib's StandardScaler scales columns using the unbiased standard deviation. This matches what R's scale package does. However, it is a bit odd for 2 reasons: * Optimization/ML algorithms which require scaled columns generally assume unit variance (for mathematical convenience). That requires using biased variance. * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance. *Question*: Should we switch to unbiased? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?
[ https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231532#comment-15231532 ] Joseph K. Bradley commented on SPARK-14478: --- I'm listing this as "Major" priority since it is a behavioral change and would be good to decide before 2.0. > Should StandardScaler use biased variance to scale? > --- > > Key: SPARK-14478 > URL: https://issues.apache.org/jira/browse/SPARK-14478 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Reporter: Joseph K. Bradley > > Currently, MLlib's StandardScaler scales columns using the unbiased standard > deviation. This matches what R's scale package does. > However, it is a bit odd for 2 reasons: > * Optimization/ML algorithms which require scaled columns generally assume > unit variance (for mathematical convenience). That requires using biased > variance. > * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance. > *Question*: Should we switch to unbiased? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231491#comment-15231491 ] Shea Parkes commented on SPARK-13842: - Pull request is available (https://github.com/apache/spark/pull/12251). I did go ahead and make the {{names}} and {{_needSerializeAnyField}} attributes lazy while I was at it. I'll try to ping you guys appropriately on there. > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13842: Assignee: Apache Spark > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Assignee: Apache Spark >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14472) Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from JavaCallable
[ https://issues.apache.org/jira/browse/SPARK-14472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14472: -- Assignee: Bryan Cutler > Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from > JavaCallable > -- > > Key: SPARK-14472 > URL: https://issues.apache.org/jira/browse/SPARK-14472 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > Currently, JavaCallable is used to wrap a plain Java object and act as a > mixin to JavaModel to provide a convenient method to make Java calls to an > object defined in JavaWrapper. The inheritance structure could be simplified > by defining the object in JavaCallable and use as a base class for > JavaWrapper. Also, some renaming of these classes might better reflect their > purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231490#comment-15231490 ] Apache Spark commented on SPARK-13842: -- User 'skparkes' has created a pull request for this issue: https://github.com/apache/spark/pull/12251 > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13842: Assignee: (was: Apache Spark) > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231465#comment-15231465 ] Reynold Xin commented on SPARK-10063: - I think Josh et al already replied -- but to close the loop, the direct committer is not safe when there is a network partition, e.g. Spark driver might not be aware of a task that's running on the executor. > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14452) Explicit APIs in Scala for specifying encoders
[ https://issues.apache.org/jira/browse/SPARK-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14452. - Resolution: Fixed Fix Version/s: 2.0.0 > Explicit APIs in Scala for specifying encoders > -- > > Key: SPARK-14452 > URL: https://issues.apache.org/jira/browse/SPARK-14452 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > The Scala Dataset public API currently only allows users to specify encoders > through SQLContext.implicits. This is OK but sometimes people want to > explicitly get encoders without a SQLContext (e.g. Aggregator > implementations). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14449) SparkContext should use SparkListenerInterface
[ https://issues.apache.org/jira/browse/SPARK-14449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14449. - Resolution: Fixed Fix Version/s: 2.0.0 > SparkContext should use SparkListenerInterface > -- > > Key: SPARK-14449 > URL: https://issues.apache.org/jira/browse/SPARK-14449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups
[ https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231442#comment-15231442 ] Kevin Hogeland edited comment on SPARK-14437 at 4/8/16 12:56 AM: - [~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is able to connect to the block manager. Thanks for the quick patch. I also encountered this error when trying to run with this change on the latest 2.0.0-SNAPSHOT, possibly unrelated but worth documenting here: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 (TID 24, ip-172-16-15-0.us-west-2.compute.internal): java.lang.RuntimeException: Stream '/jars/' was not found. at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) {code} was (Author: hogeland): [~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is able to connect to the block manager. Thanks for the quick patch. I also encountered this error when trying to run on the latest 2.0.0-SNAPSHOT, possibly unrelated but worth documenting here: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 (TID 24, ip-172-16-15-0.us-west-2.compute.internal): java.lang.RuntimeException: Stream '/jars/' was not found. at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at
[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups
[ https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231442#comment-15231442 ] Kevin Hogeland commented on SPARK-14437: [~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is able to connect to the block manager. Thanks for the quick patch. I also encountered this error when trying to run on the latest 2.0.0-SNAPSHOT, possibly unrelated but worth documenting here: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 (TID 24, ip-172-16-15-0.us-west-2.compute.internal): java.lang.RuntimeException: Stream '/jars/' was not found. at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) {code} > Spark using Netty RPC gets wrong address in some setups > --- > > Key: SPARK-14437 > URL: https://issues.apache.org/jira/browse/SPARK-14437 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.0, 1.6.1 > Environment: AWS, Docker, Flannel >Reporter: Kevin Hogeland > > Netty can't get the correct origin address in certain network setups. Spark > should handle this, as relying on Netty correctly reporting all addresses > leads to incompatible and unpredictable network states. We're currently using > Docker with Flannel on AWS. Container communication looks something like: > {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) > -> Container 2 (4.5.6.1)}} > If the client in that setup is Container 1 (1.2.3.4), Netty channels from > there to Container 2 will have a client address of 1.2.3.0. > The {{RequestMessage}} object that is sent over the wire already contains a > {{senderAddress}} field that the sender can use to specify their address. In > {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client > socket address when null. {{senderAddress}} in the messages sent from the > executors is currently always null, meaning all messages will have these > incorrect addresses (we've switched back to Akka as a temporary workaround > for this). The executor should send its address explicitly so that the driver > doesn't attempt to infer addresses based on possibly incorrect information > from Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Resolved] (SPARK-14468) Always enable OutputCommitCoordinator
[ https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14468. --- Resolution: Fixed Fix Version/s: 1.5.2 2.0.0 1.6.2 1.4.2 Target Version/s: 1.5.2, 1.4.2, 1.6.2, 2.0.0 (was: 1.4.2, 1.5.2, 1.6.2, 2.0.0) > Always enable OutputCommitCoordinator > - > > Key: SPARK-14468 > URL: https://issues.apache.org/jira/browse/SPARK-14468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.4.2, 1.6.2, 2.0.0, 1.5.2 > > > The OutputCommitCoordinator was originally introduced in SPARK-4879 because > speculation causes the output of some partitions to be deleted. However, as > we can see in SPARK-10063, speculation is not the only case where this can > happen. > More specifically, when we retry a stage we're not guaranteed to kill the > tasks that are still running (we don't even interrupt their threads), so we > may end up with multiple concurrent task attempts for the same task. This > leads to problems like SPARK-8029, but this fix alone is necessary but not > sufficient. > In general, when we run into situations like these, we need the > OutputCommitCoordinator because we don't control what the underlying file > system does. Enabling this doesn't induce heavy performance costs so there's > little reason why we shouldn't always enable it to ensure correctness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231425#comment-15231425 ] Apache Spark commented on SPARK-14477: -- User 'markgrover' has created a pull request for this issue: https://github.com/apache/spark/pull/12250 > Allow custom mirrors for downloading artifacts in build/mvn > --- > > Key: SPARK-14477 > URL: https://issues.apache.org/jira/browse/SPARK-14477 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala > from. It makes sense to override these locations with mirrors in many cases, > so this change will add support for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14477: Assignee: (was: Apache Spark) > Allow custom mirrors for downloading artifacts in build/mvn > --- > > Key: SPARK-14477 > URL: https://issues.apache.org/jira/browse/SPARK-14477 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala > from. It makes sense to override these locations with mirrors in many cases, > so this change will add support for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14477: Assignee: Apache Spark > Allow custom mirrors for downloading artifacts in build/mvn > --- > > Key: SPARK-14477 > URL: https://issues.apache.org/jira/browse/SPARK-14477 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Mark Grover >Assignee: Apache Spark >Priority: Minor > > Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala > from. It makes sense to override these locations with mirrors in many cases, > so this change will add support for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn
Mark Grover created SPARK-14477: --- Summary: Allow custom mirrors for downloading artifacts in build/mvn Key: SPARK-14477 URL: https://issues.apache.org/jira/browse/SPARK-14477 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.0.0 Reporter: Mark Grover Priority: Minor Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala from. It makes sense to override these locations with mirrors in many cases, so this change will add support for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14270) whole stage codegen support for typed filter
[ https://issues.apache.org/jira/browse/SPARK-14270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14270. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12061 [https://github.com/apache/spark/pull/12061] > whole stage codegen support for typed filter > > > Key: SPARK-14270 > URL: https://issues.apache.org/jira/browse/SPARK-14270 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14415) All functions should show usages by command `DESC FUNCTION`
[ https://issues.apache.org/jira/browse/SPARK-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14415: -- Description: Currently, many functions do now show usages like the followings. {code} scala> sql("desc function extended `sin`").collect().foreach(println) [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: To be added.] [Extended Usage: To be added.] {code} This PR adds descriptions for functions and adds a testcase prevent adding function without usage. {code} scala> sql("desc function extended `sin`").collect().foreach(println); [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: sin(x) - Returns the sine of x.] [Extended Usage: > SELECT sin(0); 0.0] {code} The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`. was: For Spark SQL, this issue aims to show the following function (expression) description properly by adding `ExpressionDescription` annotation. *Functions* abs acos asin atan atan2 ascii base64 bin ceil ceiling concat concat_ws conv cos cosh decode degrees e encode exp expm1 hex hypot factorial find_in_set floor format_number format_string instr length levenshtein locate log log2 log10 log1p lpad ltrim pi pmod pow power radians repeat reverse round rpad rtrim shiftleft shiftright shiftrightunsigned signum sin sinh soundex sqrt substr substring substring_index tan tanh translate trim unbase64 unhex *Files* sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala *Before* {code} scala> sql("desc function extended `sin`").collect().foreach(println) [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: To be added.] [Extended Usage: To be added.] {code} *After* {code} scala> sql("desc function extended `sin`").collect().foreach(println); [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: sin(x) - Returns the sine of x.] [Extended Usage: > SELECT sin(0); 0.0] {code} Summary: All functions should show usages by command `DESC FUNCTION` (was: Add ExpressionDescription annotation for SQL expressions) > All functions should show usages by command `DESC FUNCTION` > --- > > Key: SPARK-14415 > URL: https://issues.apache.org/jira/browse/SPARK-14415 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > Currently, many functions do now show usages like the followings. > {code} > scala> sql("desc function extended `sin`").collect().foreach(println) > [Function: sin] > [Class: org.apache.spark.sql.catalyst.expressions.Sin] > [Usage: To be added.] > [Extended Usage: > To be added.] > {code} > This PR adds descriptions for functions and adds a testcase prevent adding > function without usage. > {code} > scala> sql("desc function extended `sin`").collect().foreach(println); > [Function: sin] > [Class: org.apache.spark.sql.catalyst.expressions.Sin] > [Usage: sin(x) - Returns the sine of x.] > [Extended Usage: > > SELECT sin(0); > 0.0] > {code} > The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231097#comment-15231097 ] Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:01 AM: Note on StandardScaler: MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std. * [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std. R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std. I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1. * [~mengxr] reports that glmnet uses the biased sample std. * *Q*: Should we change StandardScaler to use unbiased sample std? was (Author: josephkb): StandardScaler: This may be 2 confounded issues. MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std. * *Q*: [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std. R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std. I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1. My main question is: What does glmnet do? This is important since we compare with it for MLlib GLM unit tests. The difference might be insignificant, though, for GLMs and the datasets we are testing on. > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14408: -- Comment: was deleted (was: Hm, maybe this is just a bug in this PR, looking at IDF. De-escalating for now...) > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230994#comment-15230994 ] Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:00 AM: After a bit of a scare (b/c of the confounding issue of StandardScaler not matching sklearn), it's definitely an issue with my initial PR to "fix" treeAggregate's implementation. That said, I'm still having a hard time figuring out the right way to fix the implementation. I'll comment more on the PR. was (Author: josephkb): Not meaning to cause panic here, but I'm escalating this since it might be a critical bug in MLlib. [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you please help me confirm that this is a bug? If you agree, then we can: * Change this to a blocker for 2.0 * Update all failing unit tests. ** I propose to do this in a single PR. It would be great to get help with fixing the unit tests via PRs sent to my PR. ** Alternatively, we could split up this work by creating a temporary {{private[spark] def brokenTreeAggregate}} method to be used for unit tests not yet ported to the fixed treeAggregate. But I'd prefer not to do this since we will want to backport the fix. * Backport to all reasonable versions. This will be painful because of unit tests. Currently, I'm testing StandardScaler and IDF. > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14408: -- Priority: Minor (was: Major) > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14476) Show table name or path in string of DataSourceScan
[ https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231354#comment-15231354 ] Davies Liu commented on SPARK-14476: cc [~lian cheng] > Show table name or path in string of DataSourceScan > --- > > Key: SPARK-14476 > URL: https://issues.apache.org/jira/browse/SPARK-14476 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > right now, the string of DataSourceScan is only "HadoopFiles xxx", without > any information about the table name or path. > Since we have that in 1.6, this is kind of regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14476) Show table name or path in string of DataSourceScan
Davies Liu created SPARK-14476: -- Summary: Show table name or path in string of DataSourceScan Key: SPARK-14476 URL: https://issues.apache.org/jira/browse/SPARK-14476 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu right now, the string of DataSourceScan is only "HadoopFiles xxx", without any information about the table name or path. Since we have that in 1.6, this is kind of regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14475) Propagate user-defined context from driver to executors
[ https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14475: Assignee: (was: Apache Spark) > Propagate user-defined context from driver to executors > --- > > Key: SPARK-14475 > URL: https://issues.apache.org/jira/browse/SPARK-14475 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Eric Liang > > It would be useful (e.g. for tracing) to automatically propagate arbitrary > user defined context (i.e. thread-locals) from the driver to executors. We > can do this easily by adding sc.localProperties to TaskContext. > cc [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14475) Propagate user-defined context from driver to executors
[ https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231353#comment-15231353 ] Apache Spark commented on SPARK-14475: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/12248 > Propagate user-defined context from driver to executors > --- > > Key: SPARK-14475 > URL: https://issues.apache.org/jira/browse/SPARK-14475 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Eric Liang > > It would be useful (e.g. for tracing) to automatically propagate arbitrary > user defined context (i.e. thread-locals) from the driver to executors. We > can do this easily by adding sc.localProperties to TaskContext. > cc [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14475) Propagate user-defined context from driver to executors
[ https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14475: Assignee: Apache Spark > Propagate user-defined context from driver to executors > --- > > Key: SPARK-14475 > URL: https://issues.apache.org/jira/browse/SPARK-14475 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Eric Liang >Assignee: Apache Spark > > It would be useful (e.g. for tracing) to automatically propagate arbitrary > user defined context (i.e. thread-locals) from the driver to executors. We > can do this easily by adding sc.localProperties to TaskContext. > cc [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14388) Create Table
[ https://issues.apache.org/jira/browse/SPARK-14388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-14388: - Assignee: Andrew Or > Create Table > > > Key: SPARK-14388 > URL: https://issues.apache.org/jira/browse/SPARK-14388 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Andrew Or > > For now, we still ask Hive to handle creating hive tables. We should handle > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14475) Propagate user-defined context from driver to executors
Eric Liang created SPARK-14475: -- Summary: Propagate user-defined context from driver to executors Key: SPARK-14475 URL: https://issues.apache.org/jira/browse/SPARK-14475 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Eric Liang It would be useful (e.g. for tracing) to automatically propagate arbitrary user defined context (i.e. thread-locals) from the driver to executors. We can do this easily by adding sc.localProperties to TaskContext. cc [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14410) SessionCatalog needs to check function existence
[ https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-14410: -- Comment: was deleted (was: User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/12183) > SessionCatalog needs to check function existence > - > > Key: SPARK-14410 > URL: https://issues.apache.org/jira/browse/SPARK-14410 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Andrew Or > Fix For: 2.0.0 > > > Right now, operations for an existing functions in SessionCatalog do not > really check if the function exists. We should add this check and avoid of > doing the check in command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14474) Move FileSource offset log into checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14474: Assignee: Apache Spark (was: Shixiong Zhu) > Move FileSource offset log into checkpointLocation > -- > > Key: SPARK-14474 > URL: https://issues.apache.org/jira/browse/SPARK-14474 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Now that we have a single location for storing checkpointed state, propagate > this information into the source so that we don't have one random log off on > its own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14474) Move FileSource offset log into checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231334#comment-15231334 ] Apache Spark commented on SPARK-14474: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/12247 > Move FileSource offset log into checkpointLocation > -- > > Key: SPARK-14474 > URL: https://issues.apache.org/jira/browse/SPARK-14474 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Now that we have a single location for storing checkpointed state, propagate > this information into the source so that we don't have one random log off on > its own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14474) Move FileSource offset log into checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14474: Assignee: Shixiong Zhu (was: Apache Spark) > Move FileSource offset log into checkpointLocation > -- > > Key: SPARK-14474 > URL: https://issues.apache.org/jira/browse/SPARK-14474 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Now that we have a single location for storing checkpointed state, propagate > this information into the source so that we don't have one random log off on > its own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14474) Move FileSource offset log into checkpointLocation
Shixiong Zhu created SPARK-14474: Summary: Move FileSource offset log into checkpointLocation Key: SPARK-14474 URL: https://issues.apache.org/jira/browse/SPARK-14474 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Now that we have a single location for storing checkpointed state, propagate this information into the source so that we don't have one random log off on its own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14473) Define analysis rules for operations not supported in streaming
[ https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231330#comment-15231330 ] Apache Spark commented on SPARK-14473: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/12246 > Define analysis rules for operations not supported in streaming > --- > > Key: SPARK-14473 > URL: https://issues.apache.org/jira/browse/SPARK-14473 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > There are many operations that are currently not supported in the streaming > execution. For example: > Some examples: > - joining two streams > - unioning a stream and a batch source > - sorting > - window functions (not time windows) > - distinct aggregates > Furthermore, executing a query with a stream source as a batch query should > also fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14473) Define analysis rules for operations not supported in streaming
[ https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14473: Assignee: Apache Spark (was: Tathagata Das) > Define analysis rules for operations not supported in streaming > --- > > Key: SPARK-14473 > URL: https://issues.apache.org/jira/browse/SPARK-14473 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > > There are many operations that are currently not supported in the streaming > execution. For example: > Some examples: > - joining two streams > - unioning a stream and a batch source > - sorting > - window functions (not time windows) > - distinct aggregates > Furthermore, executing a query with a stream source as a batch query should > also fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14473) Define analysis rules for operations not supported in streaming
[ https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14473: Assignee: Tathagata Das (was: Apache Spark) > Define analysis rules for operations not supported in streaming > --- > > Key: SPARK-14473 > URL: https://issues.apache.org/jira/browse/SPARK-14473 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > There are many operations that are currently not supported in the streaming > execution. For example: > Some examples: > - joining two streams > - unioning a stream and a batch source > - sorting > - window functions (not time windows) > - distinct aggregates > Furthermore, executing a query with a stream source as a batch query should > also fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14473) Define analysis rules for operations not supported in streaming
[ https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-14473: -- Description: There are many operations that are currently not supported in the streaming execution. For example: - joining two streams - unioning a stream and a batch source - sorting - window functions (not time windows) - distinct aggregates Furthermore, executing a query with a stream source as a batch query should also fail. was: There are many operations that are currently not supported in the streaming execution. For example: Some examples: - joining two streams - unioning a stream and a batch source - sorting - window functions (not time windows) - distinct aggregates Furthermore, executing a query with a stream source as a batch query should also fail. > Define analysis rules for operations not supported in streaming > --- > > Key: SPARK-14473 > URL: https://issues.apache.org/jira/browse/SPARK-14473 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > There are many operations that are currently not supported in the streaming > execution. For example: > - joining two streams > - unioning a stream and a batch source > - sorting > - window functions (not time windows) > - distinct aggregates > Furthermore, executing a query with a stream source as a batch query should > also fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14410) SessionCatalog needs to check function existence
[ https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14410. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12198 [https://github.com/apache/spark/pull/12198] > SessionCatalog needs to check function existence > - > > Key: SPARK-14410 > URL: https://issues.apache.org/jira/browse/SPARK-14410 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Andrew Or > Fix For: 2.0.0 > > > Right now, operations for an existing functions in SessionCatalog do not > really check if the function exists. We should add this check and avoid of > doing the check in command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14414) Make error messages consistent across DDLs
[ https://issues.apache.org/jira/browse/SPARK-14414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231321#comment-15231321 ] Yin Huai commented on SPARK-14414: -- Let's also take care https://github.com/apache/spark/pull/12198#discussion_r58955840 with this PR. > Make error messages consistent across DDLs > -- > > Key: SPARK-14414 > URL: https://issues.apache.org/jira/browse/SPARK-14414 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > There are many different error messages right now when the user tries to run > something that's not supported. We might throw AnalysisException or > ParseException or NoSuchFunctionException etc. We should make all of these > consistent before 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14473) Define analysis rules for operations not supported in streaming
Tathagata Das created SPARK-14473: - Summary: Define analysis rules for operations not supported in streaming Key: SPARK-14473 URL: https://issues.apache.org/jira/browse/SPARK-14473 Project: Spark Issue Type: Sub-task Components: SQL, Streaming Reporter: Tathagata Das Assignee: Tathagata Das There are many operations that are currently not supported in the streaming execution. For example: Some examples: - joining two streams - unioning a stream and a batch source - sorting - window functions (not time windows) - distinct aggregates Furthermore, executing a query with a stream source as a batch query should also fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14472) Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from JavaCallable
[ https://issues.apache.org/jira/browse/SPARK-14472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231320#comment-15231320 ] Bryan Cutler commented on SPARK-14472: -- I'm working on it :D > Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from > JavaCallable > -- > > Key: SPARK-14472 > URL: https://issues.apache.org/jira/browse/SPARK-14472 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > > Currently, JavaCallable is used to wrap a plain Java object and act as a > mixin to JavaModel to provide a convenient method to make Java calls to an > object defined in JavaWrapper. The inheritance structure could be simplified > by defining the object in JavaCallable and use as a base class for > JavaWrapper. Also, some renaming of these classes might better reflect their > purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14472) Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from JavaCallable
Bryan Cutler created SPARK-14472: Summary: Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from JavaCallable Key: SPARK-14472 URL: https://issues.apache.org/jira/browse/SPARK-14472 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Bryan Cutler Priority: Minor Currently, JavaCallable is used to wrap a plain Java object and act as a mixin to JavaModel to provide a convenient method to make Java calls to an object defined in JavaWrapper. The inheritance structure could be simplified by defining the object in JavaCallable and use as a base class for JavaWrapper. Also, some renaming of these classes might better reflect their purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency
[ https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231308#comment-15231308 ] DB Tsai commented on SPARK-13944: - For production use case, it's not desirable to include the whole Spark stack to use the linear algebra library or even models in Spark mllib, and a lot of time, those implementation can be standalone without depending on Spark platform. Due to the current mllib depending on Spark platform, if one wants to use it in production, it often causes jar conflict, and people end up reimplementing for production again. The goal for this PR is only separate our the local linear algebra out from mllib, and set up a build that we can provide the mllib-local jar. The long term goal will be gradually moving the platform independent code out from mllib to mllib-local, so people can easily use them in their production apps. > Separate out local linear algebra as a standalone module without Spark > dependency > - > > Key: SPARK-13944 > URL: https://issues.apache.org/jira/browse/SPARK-13944 > Project: Spark > Issue Type: New Feature > Components: Build, ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: DB Tsai >Priority: Blocker > > Separate out linear algebra as a standalone module without Spark dependency > to simplify production deployment. We can call the new module > spark-mllib-local, which might contain local models in the future. > The major issue is to remove dependencies on user-defined types. > The package name will be changed from mllib to ml. For example, Vector will > be changed from `org.apache.spark.mllib.linalg.Vector` to > `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML > pipeline will be the one in ML package; however, the existing mllib code will > not be touched. As a result, this will potentially break the API. Also, when > the vector is loaded from mllib vector by Spark SQL, the vector will > automatically converted into the one in ml package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14471) The alias created in SELECT could be used in GROUP BY
Davies Liu created SPARK-14471: -- Summary: The alias created in SELECT could be used in GROUP BY Key: SPARK-14471 URL: https://issues.apache.org/jira/browse/SPARK-14471 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu This query should be able to run: {code} select a a1, count(1) from t group by a1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14462) Add the mllib-local build to maven pom
[ https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14462: Assignee: DB Tsai (was: Apache Spark) > Add the mllib-local build to maven pom > -- > > Key: SPARK-14462 > URL: https://issues.apache.org/jira/browse/SPARK-14462 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Blocker > > In order to separate the linear algebra, and vector matrix classes into a > standalone jar, we need to setup the build first. This task will create a new > jar called mllib-local with minimal dependencies. The test scope will still > depend on spark-core and spark-core-test in order to use the common > utilities, but the runtime will avoid any platform dependency. Couple > platform independent classes will be moved to this package to demonstrate how > this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14462) Add the mllib-local build to maven pom
[ https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231299#comment-15231299 ] Apache Spark commented on SPARK-14462: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/12241 > Add the mllib-local build to maven pom > -- > > Key: SPARK-14462 > URL: https://issues.apache.org/jira/browse/SPARK-14462 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Blocker > > In order to separate the linear algebra, and vector matrix classes into a > standalone jar, we need to setup the build first. This task will create a new > jar called mllib-local with minimal dependencies. The test scope will still > depend on spark-core and spark-core-test in order to use the common > utilities, but the runtime will avoid any platform dependency. Couple > platform independent classes will be moved to this package to demonstrate how > this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14462) Add the mllib-local build to maven pom
[ https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14462: Assignee: Apache Spark (was: DB Tsai) > Add the mllib-local build to maven pom > -- > > Key: SPARK-14462 > URL: https://issues.apache.org/jira/browse/SPARK-14462 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Blocker > > In order to separate the linear algebra, and vector matrix classes into a > standalone jar, we need to setup the build first. This task will create a new > jar called mllib-local with minimal dependencies. The test scope will still > depend on spark-core and spark-core-test in order to use the common > utilities, but the runtime will avoid any platform dependency. Couple > platform independent classes will be moved to this package to demonstrate how > this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14469) Remove mllib-local from mima project exclusion
[ https://issues.apache.org/jira/browse/SPARK-14469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231283#comment-15231283 ] Sean Owen commented on SPARK-14469: --- This seems like a duplicate or sub-task of your other JIRAs for mllib-local. Let's combine them? I don't see how this is a stand-alone task. > Remove mllib-local from mima project exclusion > -- > > Key: SPARK-14469 > URL: https://issues.apache.org/jira/browse/SPARK-14469 > Project: Spark > Issue Type: Task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: DB Tsai >Assignee: DB Tsai > > We need to remove the exclude once 2.0 has been published and there is a > previous artifact for MiMa to compare against. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency
[ https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231276#comment-15231276 ] Sean Owen commented on SPARK-13944: --- I'm still not clear on the purpose of this change. I don't think Spark has a goal of providing local, non-distributed, non-Spark-based ML implementations. I can imagine providing a module of API classes only, but, that also does not seem to be the purpose here. What is in this "local" module and why? > Separate out local linear algebra as a standalone module without Spark > dependency > - > > Key: SPARK-13944 > URL: https://issues.apache.org/jira/browse/SPARK-13944 > Project: Spark > Issue Type: New Feature > Components: Build, ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: DB Tsai >Priority: Blocker > > Separate out linear algebra as a standalone module without Spark dependency > to simplify production deployment. We can call the new module > spark-mllib-local, which might contain local models in the future. > The major issue is to remove dependencies on user-defined types. > The package name will be changed from mllib to ml. For example, Vector will > be changed from `org.apache.spark.mllib.linalg.Vector` to > `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML > pipeline will be the one in ML package; however, the existing mllib code will > not be touched. As a result, this will potentially break the API. Also, when > the vector is loaded from mllib vector by Spark SQL, the vector will > automatically converted into the one in ml package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14127) [Table related commands] Describe table
[ https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231268#comment-15231268 ] Xiao Li commented on SPARK-14127: - We will enable it in SQLContext. Move it to sql/core. > [Table related commands] Describe table > --- > > Key: SPARK-14127 > URL: https://issues.apache.org/jira/browse/SPARK-14127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > TOK_DESCTABLE > Describe a column/table/partition (see here and here). Seems we support > DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other > syntaxes (and check if we are missing anything). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14470) Allow for overriding both httpclient and httpcore versions
[ https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231255#comment-15231255 ] Aaron Tokhy commented on SPARK-14470: - Patch cleanly applies to both branch-1.6 and master > Allow for overriding both httpclient and httpcore versions > -- > > Key: SPARK-14470 > URL: https://issues.apache.org/jira/browse/SPARK-14470 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Aaron Tokhy >Priority: Minor > > The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and > 'httpcore' versions are the same. This restriction isn't necessarily true, > as you could potentially have an httpclient version of 4.3.6 and an httpcore > version of 4.3.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14470) Allow for overriding both httpclient and httpcore versions
[ https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14470: Assignee: (was: Apache Spark) > Allow for overriding both httpclient and httpcore versions > -- > > Key: SPARK-14470 > URL: https://issues.apache.org/jira/browse/SPARK-14470 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Aaron Tokhy >Priority: Minor > > The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and > 'httpcore' versions are the same. This restriction isn't necessarily true, > as you could potentially have an httpclient version of 4.3.6 and an httpcore > version of 4.3.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14470) Allow for overriding both httpclient and httpcore versions
[ https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14470: Assignee: Apache Spark > Allow for overriding both httpclient and httpcore versions > -- > > Key: SPARK-14470 > URL: https://issues.apache.org/jira/browse/SPARK-14470 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Aaron Tokhy >Assignee: Apache Spark >Priority: Minor > > The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and > 'httpcore' versions are the same. This restriction isn't necessarily true, > as you could potentially have an httpclient version of 4.3.6 and an httpcore > version of 4.3.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14470) Allow for overriding both httpclient and httpcore versions
[ https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231252#comment-15231252 ] Apache Spark commented on SPARK-14470: -- User 'atokhy' has created a pull request for this issue: https://github.com/apache/spark/pull/12245 > Allow for overriding both httpclient and httpcore versions > -- > > Key: SPARK-14470 > URL: https://issues.apache.org/jira/browse/SPARK-14470 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Aaron Tokhy >Priority: Minor > > The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and > 'httpcore' versions are the same. This restriction isn't necessarily true, > as you could potentially have an httpclient version of 4.3.6 and an httpcore > version of 4.3.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14468) Always enable OutputCommitCoordinator
[ https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231235#comment-15231235 ] Apache Spark commented on SPARK-14468: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/12244 > Always enable OutputCommitCoordinator > - > > Key: SPARK-14468 > URL: https://issues.apache.org/jira/browse/SPARK-14468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > The OutputCommitCoordinator was originally introduced in SPARK-4879 because > speculation causes the output of some partitions to be deleted. However, as > we can see in SPARK-10063, speculation is not the only case where this can > happen. > More specifically, when we retry a stage we're not guaranteed to kill the > tasks that are still running (we don't even interrupt their threads), so we > may end up with multiple concurrent task attempts for the same task. This > leads to problems like SPARK-8029, but this fix alone is necessary but not > sufficient. > In general, when we run into situations like these, we need the > OutputCommitCoordinator because we don't control what the underlying file > system does. Enabling this doesn't induce heavy performance costs so there's > little reason why we shouldn't always enable it to ensure correctness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14468) Always enable OutputCommitCoordinator
[ https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14468: Assignee: Andrew Or (was: Apache Spark) > Always enable OutputCommitCoordinator > - > > Key: SPARK-14468 > URL: https://issues.apache.org/jira/browse/SPARK-14468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > The OutputCommitCoordinator was originally introduced in SPARK-4879 because > speculation causes the output of some partitions to be deleted. However, as > we can see in SPARK-10063, speculation is not the only case where this can > happen. > More specifically, when we retry a stage we're not guaranteed to kill the > tasks that are still running (we don't even interrupt their threads), so we > may end up with multiple concurrent task attempts for the same task. This > leads to problems like SPARK-8029, but this fix alone is necessary but not > sufficient. > In general, when we run into situations like these, we need the > OutputCommitCoordinator because we don't control what the underlying file > system does. Enabling this doesn't induce heavy performance costs so there's > little reason why we shouldn't always enable it to ensure correctness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14468) Always enable OutputCommitCoordinator
[ https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14468: Assignee: Apache Spark (was: Andrew Or) > Always enable OutputCommitCoordinator > - > > Key: SPARK-14468 > URL: https://issues.apache.org/jira/browse/SPARK-14468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Andrew Or >Assignee: Apache Spark > > The OutputCommitCoordinator was originally introduced in SPARK-4879 because > speculation causes the output of some partitions to be deleted. However, as > we can see in SPARK-10063, speculation is not the only case where this can > happen. > More specifically, when we retry a stage we're not guaranteed to kill the > tasks that are still running (we don't even interrupt their threads), so we > may end up with multiple concurrent task attempts for the same task. This > leads to problems like SPARK-8029, but this fix alone is necessary but not > sufficient. > In general, when we run into situations like these, we need the > OutputCommitCoordinator because we don't control what the underlying file > system does. Enabling this doesn't induce heavy performance costs so there's > little reason why we shouldn't always enable it to ensure correctness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14470) Allow for overriding both httpclient and httpcore versions
Aaron Tokhy created SPARK-14470: --- Summary: Allow for overriding both httpclient and httpcore versions Key: SPARK-14470 URL: https://issues.apache.org/jira/browse/SPARK-14470 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.1, 2.0.0 Reporter: Aaron Tokhy Priority: Minor The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 'httpcore' versions are the same. This restriction isn't necessarily true, as you could potentially have an httpclient version of 4.3.6 and an httpcore version of 4.3.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231210#comment-15231210 ] holdenk commented on SPARK-13842: - So the testing frameworks is a mixture of doctests along with the standard unittest2 stuff (in the tests.py file in each sub directory). Let me know if you have any questions while your doing this that I can help with :) > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231209#comment-15231209 ] holdenk commented on SPARK-13842: - So the testing frameworks is a mixture of doctests along with the standard unittest2 stuff (in the tests.py file in each sub directory). Let me know if you have any questions while your doing this that I can help with :) > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14469) Remove mllib-local from mima project exclusion
DB Tsai created SPARK-14469: --- Summary: Remove mllib-local from mima project exclusion Key: SPARK-14469 URL: https://issues.apache.org/jira/browse/SPARK-14469 Project: Spark Issue Type: Task Components: ML, MLlib Affects Versions: 2.0.0 Reporter: DB Tsai We need to remove the exclude once 2.0 has been published and there is a previous artifact for MiMa to compare against. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14469) Remove mllib-local from mima project exclusion
[ https://issues.apache.org/jira/browse/SPARK-14469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-14469: --- Assignee: DB Tsai > Remove mllib-local from mima project exclusion > -- > > Key: SPARK-14469 > URL: https://issues.apache.org/jira/browse/SPARK-14469 > Project: Spark > Issue Type: Task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: DB Tsai >Assignee: DB Tsai > > We need to remove the exclude once 2.0 has been published and there is a > previous artifact for MiMa to compare against. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
[ https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231201#comment-15231201 ] Shea Parkes commented on SPARK-13842: - I'm willing to give it a first pass. Need to go dig up what Python testing framework you guys are using, but that shouldn't be too hard. Unless anyone objects, I'd like to move StructType.names and StructType._needSerializeAnyField to properties at the same time. Should be a seamless refactor and cut down on the likelihood of future errors. Might even get to it tonight. Thanks! > Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType > -- > > Key: SPARK-13842 > URL: https://issues.apache.org/jira/browse/SPARK-13842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Shea Parkes >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to > {{pyspark.sql.types.StructType}}. Here are some simplistic suggestions: > {code} > def __iter__(self): > """Iterate the fields upon request.""" > return iter(self.fields) > def __getitem__(self, key): > """Return the corresponding StructField""" > _fields_dict = dict(zip(self.names, self.fields)) > try: > return _fields_dict[key] > except KeyError: > raise KeyError('No field named {}'.format(key)) > {code} > I realize the latter might be a touch more controversial since there could be > name collisions. Still, I doubt there are that many in practice and it would > be quite nice to work with. > Privately, I have more extensive metadata extraction methods overlaid on this > class, but I imagine the rest of what I have done might go too far for the > common user. If this request gains traction though, I'll share those other > layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14468) Always enable OutputCommitCoordinator
Andrew Or created SPARK-14468: - Summary: Always enable OutputCommitCoordinator Key: SPARK-14468 URL: https://issues.apache.org/jira/browse/SPARK-14468 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or The OutputCommitCoordinator was originally introduced in SPARK-4879 because speculation causes the output of some partitions to be deleted. However, as we can see in SPARK-10063, speculation is not the only case where this can happen. More specifically, when we retry a stage we're not guaranteed to kill the tasks that are still running (we don't even interrupt their threads), so we may end up with multiple concurrent task attempts for the same task. This leads to problems like SPARK-8029, but this fix alone is necessary but not sufficient. In general, when we run into situations like these, we need the OutputCommitCoordinator because we don't control what the underlying file system does. Enabling this doesn't induce heavy performance costs so there's little reason why we shouldn't always enable it to ensure correctness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14467) Add async io in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14467: Assignee: (was: Apache Spark) > Add async io in FileScanRDD > --- > > Key: SPARK-14467 > URL: https://issues.apache.org/jira/browse/SPARK-14467 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li > > Experiments running over parquet data in s3 shows poorly interleaving of CPU > and IO. We should do more async IO in FileScanRDD to better use the machine > resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14467) Add async io in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231155#comment-15231155 ] Apache Spark commented on SPARK-14467: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/12243 > Add async io in FileScanRDD > --- > > Key: SPARK-14467 > URL: https://issues.apache.org/jira/browse/SPARK-14467 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li > > Experiments running over parquet data in s3 shows poorly interleaving of CPU > and IO. We should do more async IO in FileScanRDD to better use the machine > resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org