[jira] [Commented] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231676#comment-15231676
 ] 

Reynold Xin commented on SPARK-14480:
-

Please go ahead!


> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100 due to the lack of resources.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14480:
-
Description: 
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}
||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}
||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for an Iterator.
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}


  was:
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}
||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}
||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for an Iterator.
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}



> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>

[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14480:
-
Description: 
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}
||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}
||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for an Iterator.
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}


  was:
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for an Iterator.
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}



> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
> 

[jira] [Commented] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231674#comment-15231674
 ] 

Hyukjin Kwon commented on SPARK-14480:
--

[~rxin] [~srowen] Could I maybe try to open a PR for this first? I think codes 
would give a clearer view.

> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100 due to the lack of resources.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14480:
-
Description: 
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for an Iterator.
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}


  was:
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for an Iterator.
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}



> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
>

[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14480:
-
Description: 
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for {{Iterator}}
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}


  was:
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(filteredIter.next()))
...
{code}



> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source 

[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14480:
-
Description: 
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for an Iterator.
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}


  was:
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
// `reader` is a wrapper for {{Iterator}}
private val reader = new StringIteratorReader(iter)
parser.beginParsing(reader)
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(iter.next()))
...
{code}



> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
> 

[jira] [Updated] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14480:
-
Description: 
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB

h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test
{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})
{code}
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})
{code}
...
time(parser.parseLine(filteredIter.next()))
...
{code}


  was:
Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB


h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test

{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})

{code}
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})

{code}
...
time(parser.parseLine(filteredIter.next()))
...
{code}



> Simplify CSV parsing process with a better performance 
> ---
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} 

[jira] [Created] (SPARK-14480) Simplify CSV parsing process with a better performance

2016-04-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-14480:


 Summary: Simplify CSV parsing process with a better performance 
 Key: SPARK-14480
 URL: https://issues.apache.org/jira/browse/SPARK-14480
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon


Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
is made like this for better performance. However, it looks there are two 
problems.

Firstly, it was actually not faster than processing line by line with 
{{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics 
to allow every line to be read bytes by bytes. So, it was pretty difficult to 
figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem 
are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 100 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages 
are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB


h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test

{code}
val path = "lineitem.tbl"
val df = sqlContext
  .read
  .format("csv")
  .option("header", "false")
  .option("delimiter", "|")
  .load(path)
time(df.take(100))
{code}

- Parsing time test for original (in {{BulkCsvParser}})

{code}
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})

{code}
...
time(parser.parseLine(filteredIter.next()))
...
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14375) Unit test for spark.ml KMeansSummary

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14375:


Assignee: Apache Spark

> Unit test for spark.ml KMeansSummary
> 
>
> Key: SPARK-14375
> URL: https://issues.apache.org/jira/browse/SPARK-14375
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> There is no unit test for KMeansSummary in spark.ml.
> Other items which could be fixed here:
> * Add Since version to KMeansSummary class
> * Modify clusterSizes method to match GMM method, to be robust to empty 
> clusters (in case we support that sometime)  (See PR for [SPARK-13538])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14375) Unit test for spark.ml KMeansSummary

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231664#comment-15231664
 ] 

Apache Spark commented on SPARK-14375:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12254

> Unit test for spark.ml KMeansSummary
> 
>
> Key: SPARK-14375
> URL: https://issues.apache.org/jira/browse/SPARK-14375
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> There is no unit test for KMeansSummary in spark.ml.
> Other items which could be fixed here:
> * Add Since version to KMeansSummary class
> * Modify clusterSizes method to match GMM method, to be robust to empty 
> clusters (in case we support that sometime)  (See PR for [SPARK-13538])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14375) Unit test for spark.ml KMeansSummary

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14375:


Assignee: (was: Apache Spark)

> Unit test for spark.ml KMeansSummary
> 
>
> Key: SPARK-14375
> URL: https://issues.apache.org/jira/browse/SPARK-14375
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> There is no unit test for KMeansSummary in spark.ml.
> Other items which could be fixed here:
> * Add Since version to KMeansSummary class
> * Modify clusterSizes method to match GMM method, to be robust to empty 
> clusters (in case we support that sometime)  (See PR for [SPARK-13538])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-07 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231655#comment-15231655
 ] 

Yan commented on SPARK-14389:
-

Actually the current Master branch does not have the issue; while 1.6.0 has. 
There appear to be improvements on BNL join since 1.6, Spark-13213 in 
particular.

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: jps_command_results.txt, lineitem.tbl, plans.txt, 
> sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14461) GLM training summaries should provide solver

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14461:


Assignee: Apache Spark

> GLM training summaries should provide solver
> 
>
> Key: SPARK-14461
> URL: https://issues.apache.org/jira/browse/SPARK-14461
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> GLM training summaries have different types of metrics available depending on 
> the solver used during training.  In the summaries, we should provide the 
> solver used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14461) GLM training summaries should provide solver

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14461:


Assignee: (was: Apache Spark)

> GLM training summaries should provide solver
> 
>
> Key: SPARK-14461
> URL: https://issues.apache.org/jira/browse/SPARK-14461
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> GLM training summaries have different types of metrics available depending on 
> the solver used during training.  In the summaries, we should provide the 
> solver used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14461) GLM training summaries should provide solver

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231618#comment-15231618
 ] 

Apache Spark commented on SPARK-14461:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12253

> GLM training summaries should provide solver
> 
>
> Key: SPARK-14461
> URL: https://issues.apache.org/jira/browse/SPARK-14461
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> GLM training summaries have different types of metrics available depending on 
> the solver used during training.  In the summaries, we should provide the 
> solver used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14127) [Table related commands] Describe table

2016-04-07 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231616#comment-15231616
 ] 

Xiao Li commented on SPARK-14127:
-


{noformat}
# Partition Information  
# col_name  data_type   comment  
{noformat}

Will be two rows. Will not have empty rows.


> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14479) GLM predict type should be link or response?

2016-04-07 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14479:

Component/s: SparkR

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14479) GLM predict type should be link or response?

2016-04-07 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14479:

Description: 
In R glm and glmnet, the default type of predict is "link" which is the linear 
predictor, users can specify "type = response" to output response prediction. 
Currently the ML glm predict will output "response" prediction by default, I 
think it's more reasonable. Should we change the default type of ML glm predict 
output? 
R glm: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet

Meanwhile, we should decide the default type of glm predict output in SparkR.

  was:
In R glm and glmnet, the default type of predict is "link" which is the linear 
predictor, users can specify "type = response" to output response prediction. 
Currently the ML glm predict will output "response" prediction by default, I 
think it's more reasonable. Should we change the default type of ML glm predict 
output? 
R glm: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet


> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14479) GLM predict type should be link or response?

2016-04-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231604#comment-15231604
 ] 

Yanbo Liang commented on SPARK-14479:
-

This will introduce break change, so it's better make decision before Spark 2.0.
cc [~mengxr] [~josephkb]

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14479) GLM predict type should be link or response?

2016-04-07 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14479:

Description: 
In R glm and glmnet, the default type of predict is "link" which is the linear 
predictor, users can specify "type = response" to output response prediction. 
Currently the ML glm predict will output "response" prediction by default, I 
think it's more reasonable. Should we change the default type of ML glm predict 
output? 
R glm: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet

  was:In R glm and glmnet, the default type of predict is "link" which is the 
linear predictor, users can specify "type = response" to output response 
prediction. Currently the ML glm predict will output "response" prediction by 
default, I think it's more reasonable. Should we change the default type of ML 
glm predict output? 


> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14479) GLM predict type should be link or response?

2016-04-07 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-14479:
---

 Summary: GLM predict type should be link or response?
 Key: SPARK-14479
 URL: https://issues.apache.org/jira/browse/SPARK-14479
 Project: Spark
  Issue Type: Question
  Components: ML
Reporter: Yanbo Liang


In R glm and glmnet, the default type of predict is "link" which is the linear 
predictor, users can specify "type = response" to output response prediction. 
Currently the ML glm predict will output "response" prediction by default, I 
think it's more reasonable. Should we change the default type of ML glm predict 
output? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14127) [Table related commands] Describe table

2016-04-07 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231600#comment-15231600
 ] 

Xiao Li commented on SPARK-14127:
-

{noformat}
hive> create table ptestfilter (a string, b int) partitioned by (c string, d 
string);
OK
Time taken: 1.464 seconds
hive> 
> describe ptestfilter;
OK
a   string  
b   int 
c   string  
d   string  
 
# Partition Information  
# col_name  data_type   comment 
 
c   string  
d   string  
Time taken: 0.449 seconds, Fetched: 10 row(s)
{noformat}


> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2016-04-07 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231585#comment-15231585
 ] 

Davies Liu commented on SPARK-8632:
---

[~bijay697] Python UDFs had been improved a lot recently in master, see 
https://issues.apache.org/jira/browse/SPARK-14267 and 
https://issues.apache.org/jira/browse/SPARK-14215.

Could you try master ?

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-04-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231579#comment-15231579
 ] 

Yanbo Liang commented on SPARK-14378:
-

I can work on it.

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231570#comment-15231570
 ] 

Apache Spark commented on SPARK-14460:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12252

> DataFrameWriter JDBC doesn't Quote/Escape column names
> --
>
> Key: SPARK-14460
> URL: https://issues.apache.org/jira/browse/SPARK-14460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Sean Rose
>  Labels: easyfix
>
> When I try to write a DataFrame which contains a column with a space in it 
> ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect 
> syntax near 'Address'
> I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping 
> column names. JdbcDialect has the "quoteIdentifier" method, which could be 
> called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14460:


Assignee: Apache Spark

> DataFrameWriter JDBC doesn't Quote/Escape column names
> --
>
> Key: SPARK-14460
> URL: https://issues.apache.org/jira/browse/SPARK-14460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Sean Rose
>Assignee: Apache Spark
>  Labels: easyfix
>
> When I try to write a DataFrame which contains a column with a space in it 
> ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect 
> syntax near 'Address'
> I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping 
> column names. JdbcDialect has the "quoteIdentifier" method, which could be 
> called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14460:


Assignee: (was: Apache Spark)

> DataFrameWriter JDBC doesn't Quote/Escape column names
> --
>
> Key: SPARK-14460
> URL: https://issues.apache.org/jira/browse/SPARK-14460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Sean Rose
>  Labels: easyfix
>
> When I try to write a DataFrame which contains a column with a space in it 
> ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect 
> syntax near 'Address'
> I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping 
> column names. JdbcDialect has the "quoteIdentifier" method, which could be 
> called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14460) DataFrameWriter JDBC doesn't Quote/Escape column names

2016-04-07 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231077#comment-15231077
 ] 

Bo Meng edited comment on SPARK-14460 at 4/8/16 3:04 AM:
-

Thanks [~srose03] for finding the root cause - That makes the fix easier.I will 
post the fix shortly.


was (Author: bomeng):
I can take a look. Thanks.

> DataFrameWriter JDBC doesn't Quote/Escape column names
> --
>
> Key: SPARK-14460
> URL: https://issues.apache.org/jira/browse/SPARK-14460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Sean Rose
>  Labels: easyfix
>
> When I try to write a DataFrame which contains a column with a space in it 
> ("Patient Address"), I get an error: java.sql.BatchUpdateException: Incorrect 
> syntax near 'Address'
> I believe the issue is that JdbcUtils.insertStatement isn't quoting/escaping 
> column names. JdbcDialect has the "quoteIdentifier" method, which could be 
> called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14403) the dag of a stage may has too many same chid cluster, and result to gc

2016-04-07 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula closed SPARK-14403.

Resolution: Resolved

> the dag of a stage may has too many same chid cluster, and result to gc
> ---
>
> Key: SPARK-14403
> URL: https://issues.apache.org/jira/browse/SPARK-14403
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
>
> When I run a sql query, I can't open the stage page on the web, and result 
> the historyserver process shut down. 
> After debug the code, I find a stage graph has more than 5000 same 
> childcluster. so when going to make dot file, the process goes down.
> I think the graph cluster shouldn't has the same child cluster, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-04-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-13048.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12166
[https://github.com/apache/spark/pull/12166]

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>Assignee: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0

2016-04-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13448:
--
Description: 
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.

  was:
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results


> Document MLlib behavior changes in Spark 2.0
> 
>
> Key: SPARK-13448
> URL: https://issues.apache.org/jira/browse/SPARK-13448
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
> remember to add them to the migration guide / release notes.
> * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
> to 1e-6.
> * SPARK-7780: Intercept will not be regularized if users train binary 
> classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, 
> because it calls ML LogisticRegresson implementation. Meanwhile if users set 
> without regularization, training with or without feature scaling will return 
> the same solution by the same convergence rate(because they run the same code 
> route), this behavior is different from the old API.
> * SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
> results
> * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
> default, if checkpointing is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14478) Should StandardScaler use biased variance to scale?

2016-04-07 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14478:
-

 Summary: Should StandardScaler use biased variance to scale?
 Key: SPARK-14478
 URL: https://issues.apache.org/jira/browse/SPARK-14478
 Project: Spark
  Issue Type: Question
  Components: ML, MLlib
Reporter: Joseph K. Bradley


Currently, MLlib's StandardScaler scales columns using the unbiased standard 
deviation.  This matches what R's scale package does.

However, it is a bit odd for 2 reasons:
* Optimization/ML algorithms which require scaled columns generally assume unit 
variance (for mathematical convenience).  That requires using biased variance.
* scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance.

*Question*: Should we switch to unbiased?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?

2016-04-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231532#comment-15231532
 ] 

Joseph K. Bradley commented on SPARK-14478:
---

I'm listing this as "Major" priority since it is a behavioral change and would 
be good to decide before 2.0.

> Should StandardScaler use biased variance to scale?
> ---
>
> Key: SPARK-14478
> URL: https://issues.apache.org/jira/browse/SPARK-14478
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> Currently, MLlib's StandardScaler scales columns using the unbiased standard 
> deviation.  This matches what R's scale package does.
> However, it is a bit odd for 2 reasons:
> * Optimization/ML algorithms which require scaled columns generally assume 
> unit variance (for mathematical convenience).  That requires using biased 
> variance.
> * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance.
> *Question*: Should we switch to unbiased?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-07 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231491#comment-15231491
 ] 

Shea Parkes commented on SPARK-13842:
-

Pull request is available (https://github.com/apache/spark/pull/12251).  I did 
go ahead and make the {{names}} and {{_needSerializeAnyField}} attributes lazy 
while I was at it.  I'll try to ping you guys appropriately on there.

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13842:


Assignee: Apache Spark

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14472) Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from JavaCallable

2016-04-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14472:
--
Assignee: Bryan Cutler

> Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from 
> JavaCallable
> --
>
> Key: SPARK-14472
> URL: https://issues.apache.org/jira/browse/SPARK-14472
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> Currently, JavaCallable is used to wrap a plain Java object and act as a 
> mixin to JavaModel to provide a convenient method to make Java calls to an 
> object defined in JavaWrapper.  The inheritance structure could be simplified 
> by defining the object in JavaCallable and use as a base class for 
> JavaWrapper.  Also, some renaming of these classes might better reflect their 
> purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231490#comment-15231490
 ] 

Apache Spark commented on SPARK-13842:
--

User 'skparkes' has created a pull request for this issue:
https://github.com/apache/spark/pull/12251

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13842:


Assignee: (was: Apache Spark)

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2016-04-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231465#comment-15231465
 ] 

Reynold Xin commented on SPARK-10063:
-

I think Josh et al already replied -- but to close the loop, the direct 
committer is not safe when there is a network partition, e.g. Spark driver 
might not be aware of a task that's running on the executor.


> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14452) Explicit APIs in Scala for specifying encoders

2016-04-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14452.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Explicit APIs in Scala for specifying encoders
> --
>
> Key: SPARK-14452
> URL: https://issues.apache.org/jira/browse/SPARK-14452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> The Scala Dataset public API currently only allows users to specify encoders 
> through SQLContext.implicits. This is OK but sometimes people want to 
> explicitly get encoders without a SQLContext (e.g. Aggregator 
> implementations).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14449) SparkContext should use SparkListenerInterface

2016-04-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14449.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> SparkContext should use SparkListenerInterface
> --
>
> Key: SPARK-14449
> URL: https://issues.apache.org/jira/browse/SPARK-14449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-07 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231442#comment-15231442
 ] 

Kevin Hogeland edited comment on SPARK-14437 at 4/8/16 12:56 AM:
-

[~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is 
able to connect to the block manager. Thanks for the quick patch.

I also encountered this error when trying to run with this change on the latest 
2.0.0-SNAPSHOT, possibly unrelated but worth documenting here:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 
(TID 24, ip-172-16-15-0.us-west-2.compute.internal): 
java.lang.RuntimeException: Stream '/jars/' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: hogeland):
[~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is 
able to connect to the block manager. Thanks for the quick patch.

I also encountered this error when trying to run on the latest 2.0.0-SNAPSHOT, 
possibly unrelated but worth documenting here:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 
(TID 24, ip-172-16-15-0.us-west-2.compute.internal): 
java.lang.RuntimeException: Stream '/jars/' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 

[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-07 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231442#comment-15231442
 ] 

Kevin Hogeland commented on SPARK-14437:


[~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is 
able to connect to the block manager. Thanks for the quick patch.

I also encountered this error when trying to run on the latest 2.0.0-SNAPSHOT, 
possibly unrelated but worth documenting here:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 
(TID 24, ip-172-16-15-0.us-west-2.compute.internal): 
java.lang.RuntimeException: Stream '/jars/' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
{code}

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-14468) Always enable OutputCommitCoordinator

2016-04-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14468.
---
  Resolution: Fixed
   Fix Version/s: 1.5.2
  2.0.0
  1.6.2
  1.4.2
Target Version/s: 1.5.2, 1.4.2, 1.6.2, 2.0.0  (was: 1.4.2, 1.5.2, 1.6.2, 
2.0.0)

> Always enable OutputCommitCoordinator
> -
>
> Key: SPARK-14468
> URL: https://issues.apache.org/jira/browse/SPARK-14468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.4.2, 1.6.2, 2.0.0, 1.5.2
>
>
> The OutputCommitCoordinator was originally introduced in SPARK-4879 because 
> speculation causes the output of some partitions to be deleted. However, as 
> we can see in SPARK-10063, speculation is not the only case where this can 
> happen.
> More specifically, when we retry a stage we're not guaranteed to kill the 
> tasks that are still running (we don't even interrupt their threads), so we 
> may end up with multiple concurrent task attempts for the same task. This 
> leads to problems like SPARK-8029, but this fix alone is necessary but not 
> sufficient.
> In general, when we run into situations like these, we need the 
> OutputCommitCoordinator because we don't control what the underlying file 
> system does. Enabling this doesn't induce heavy performance costs so there's 
> little reason why we shouldn't always enable it to ensure correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231425#comment-15231425
 ] 

Apache Spark commented on SPARK-14477:
--

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/12250

> Allow custom mirrors for downloading artifacts in build/mvn
> ---
>
> Key: SPARK-14477
> URL: https://issues.apache.org/jira/browse/SPARK-14477
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala 
> from. It makes sense to override these locations with mirrors in many cases, 
> so this change will add support for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14477:


Assignee: (was: Apache Spark)

> Allow custom mirrors for downloading artifacts in build/mvn
> ---
>
> Key: SPARK-14477
> URL: https://issues.apache.org/jira/browse/SPARK-14477
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala 
> from. It makes sense to override these locations with mirrors in many cases, 
> so this change will add support for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14477:


Assignee: Apache Spark

> Allow custom mirrors for downloading artifacts in build/mvn
> ---
>
> Key: SPARK-14477
> URL: https://issues.apache.org/jira/browse/SPARK-14477
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala 
> from. It makes sense to override these locations with mirrors in many cases, 
> so this change will add support for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn

2016-04-07 Thread Mark Grover (JIRA)
Mark Grover created SPARK-14477:
---

 Summary: Allow custom mirrors for downloading artifacts in 
build/mvn
 Key: SPARK-14477
 URL: https://issues.apache.org/jira/browse/SPARK-14477
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
Reporter: Mark Grover
Priority: Minor


Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala 
from. It makes sense to override these locations with mirrors in many cases, so 
this change will add support for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14270) whole stage codegen support for typed filter

2016-04-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14270.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12061
[https://github.com/apache/spark/pull/12061]

> whole stage codegen support for typed filter
> 
>
> Key: SPARK-14270
> URL: https://issues.apache.org/jira/browse/SPARK-14270
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14415) All functions should show usages by command `DESC FUNCTION`

2016-04-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14415:
--
Description: 
Currently, many functions do now show usages like the followings.
{code}
scala> sql("desc function extended `sin`").collect().foreach(println)
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: To be added.]
[Extended Usage:
To be added.]
{code}

This PR adds descriptions for functions and adds a testcase prevent adding 
function without usage.
{code}
scala>  sql("desc function extended `sin`").collect().foreach(println);
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: sin(x) - Returns the sine of x.]
[Extended Usage:
> SELECT sin(0);
 0.0]
{code}

The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.

  was:
For Spark SQL, this issue aims to show the following function (expression) 
description properly by adding `ExpressionDescription` annotation.

*Functions*
abs
acos
asin
atan
atan2
ascii
base64
bin
ceil
ceiling
concat
concat_ws
conv
cos
cosh
decode
degrees
e
encode
exp
expm1
hex
hypot
factorial
find_in_set
floor
format_number
format_string
instr
length
levenshtein
locate
log
log2
log10
log1p
lpad
ltrim
pi
pmod
pow
power
radians
repeat
reverse
round
rpad
rtrim
shiftleft
shiftright
shiftrightunsigned
signum
sin
sinh
soundex
sqrt
substr
substring
substring_index
tan
tanh
translate
trim
unbase64
unhex

*Files*
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

*Before*
{code}
scala> sql("desc function extended `sin`").collect().foreach(println)
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: To be added.]
[Extended Usage:
To be added.]
{code}

*After*
{code}
scala>  sql("desc function extended `sin`").collect().foreach(println);
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: sin(x) - Returns the sine of x.]
[Extended Usage:
> SELECT sin(0);
 0.0]
{code}

Summary: All functions should show usages by command `DESC FUNCTION`  
(was: Add ExpressionDescription annotation for SQL expressions)

> All functions should show usages by command `DESC FUNCTION`
> ---
>
> Key: SPARK-14415
> URL: https://issues.apache.org/jira/browse/SPARK-14415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> Currently, many functions do now show usages like the followings.
> {code}
> scala> sql("desc function extended `sin`").collect().foreach(println)
> [Function: sin]
> [Class: org.apache.spark.sql.catalyst.expressions.Sin]
> [Usage: To be added.]
> [Extended Usage:
> To be added.]
> {code}
> This PR adds descriptions for functions and adds a testcase prevent adding 
> function without usage.
> {code}
> scala>  sql("desc function extended `sin`").collect().foreach(println);
> [Function: sin]
> [Class: org.apache.spark.sql.catalyst.expressions.Sin]
> [Usage: sin(x) - Returns the sine of x.]
> [Extended Usage:
> > SELECT sin(0);
>  0.0]
> {code}
> The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231097#comment-15231097
 ] 

Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:01 AM:


Note on StandardScaler: MLlib's StandardScaler uses the unbiased sample std to 
rescale, whereas sklearn uses the biased sample std.
* [sklearn.preprocessing.StandardScaler | 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html]
 uses biased sample std.  R's [scale package | 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the 
unbiased sample std.  I'm used to seeing the biased sample std used in ML, 
probably because it is handy for proofs to know columns have L2 norm 1. 
* [~mengxr] reports that glmnet uses the biased sample std.
* *Q*: Should we change StandardScaler to use unbiased sample std?


was (Author: josephkb):
StandardScaler: This may be 2 confounded issues.  MLlib's StandardScaler uses 
the unbiased sample std to rescale, whereas sklearn uses the biased sample std.
* *Q*: [sklearn.preprocessing.StandardScaler | 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html]
 uses biased sample std.  R's [scale package | 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the 
unbiased sample std.  I'm used to seeing the biased sample std used in ML, 
probably because it is handy for proofs to know columns have L2 norm 1.  My 
main question is: What does glmnet do?  This is important since we compare with 
it for MLlib GLM unit tests.  The difference might be insignificant, though, 
for GLMs and the datasets we are testing on.

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14408:
--
Comment: was deleted

(was: Hm, maybe this is just a bug in this PR, looking at IDF.  De-escalating 
for now...)

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230994#comment-15230994
 ] 

Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:00 AM:


After a bit of a scare (b/c of the confounding issue of StandardScaler not 
matching sklearn), it's definitely an issue with my initial PR to "fix" 
treeAggregate's implementation.  That said, I'm still having a hard time 
figuring out the right way to fix the implementation.  I'll comment more on the 
PR.


was (Author: josephkb):
Not meaning to cause panic here, but I'm escalating this since it might be a 
critical bug in MLlib.  [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you 
please help me confirm that this is a bug?  If you agree, then we can:
* Change this to a blocker for 2.0
* Update all failing unit tests.
** I propose to do this in a single PR.  It would be great to get help with 
fixing the unit tests via PRs sent to my PR.
** Alternatively, we could split up this work by creating a temporary 
{{private[spark] def brokenTreeAggregate}} method to be used for unit tests not 
yet ported to the fixed treeAggregate.  But I'd prefer not to do this since we 
will want to backport the fix.
* Backport to all reasonable versions.  This will be painful because of unit 
tests.

Currently, I'm testing StandardScaler and IDF.

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14408:
--
Priority: Minor  (was: Major)

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-04-07 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231354#comment-15231354
 ] 

Davies Liu commented on SPARK-14476:


cc [~lian cheng]

> Show table name or path in string of DataSourceScan
> ---
>
> Key: SPARK-14476
> URL: https://issues.apache.org/jira/browse/SPARK-14476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> right now, the string of DataSourceScan is only "HadoopFiles xxx", without 
> any information about the table name or path. 
> Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-04-07 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14476:
--

 Summary: Show table name or path in string of DataSourceScan
 Key: SPARK-14476
 URL: https://issues.apache.org/jira/browse/SPARK-14476
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu


right now, the string of DataSourceScan is only "HadoopFiles xxx", without any 
information about the table name or path. 

Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14475) Propagate user-defined context from driver to executors

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14475:


Assignee: (was: Apache Spark)

> Propagate user-defined context from driver to executors
> ---
>
> Key: SPARK-14475
> URL: https://issues.apache.org/jira/browse/SPARK-14475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>
> It would be useful (e.g. for tracing) to automatically propagate arbitrary 
> user defined context (i.e. thread-locals) from the driver to executors. We 
> can do this easily by adding sc.localProperties to TaskContext.
> cc [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14475) Propagate user-defined context from driver to executors

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231353#comment-15231353
 ] 

Apache Spark commented on SPARK-14475:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/12248

> Propagate user-defined context from driver to executors
> ---
>
> Key: SPARK-14475
> URL: https://issues.apache.org/jira/browse/SPARK-14475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>
> It would be useful (e.g. for tracing) to automatically propagate arbitrary 
> user defined context (i.e. thread-locals) from the driver to executors. We 
> can do this easily by adding sc.localProperties to TaskContext.
> cc [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14475) Propagate user-defined context from driver to executors

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14475:


Assignee: Apache Spark

> Propagate user-defined context from driver to executors
> ---
>
> Key: SPARK-14475
> URL: https://issues.apache.org/jira/browse/SPARK-14475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> It would be useful (e.g. for tracing) to automatically propagate arbitrary 
> user defined context (i.e. thread-locals) from the driver to executors. We 
> can do this easily by adding sc.localProperties to TaskContext.
> cc [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14388) Create Table

2016-04-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-14388:
-

Assignee: Andrew Or

> Create Table
> 
>
> Key: SPARK-14388
> URL: https://issues.apache.org/jira/browse/SPARK-14388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
>
> For now, we still ask Hive to handle creating hive tables. We should handle 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14475) Propagate user-defined context from driver to executors

2016-04-07 Thread Eric Liang (JIRA)
Eric Liang created SPARK-14475:
--

 Summary: Propagate user-defined context from driver to executors
 Key: SPARK-14475
 URL: https://issues.apache.org/jira/browse/SPARK-14475
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Eric Liang


It would be useful (e.g. for tracing) to automatically propagate arbitrary user 
defined context (i.e. thread-locals) from the driver to executors. We can do 
this easily by adding sc.localProperties to TaskContext.

cc [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14410) SessionCatalog needs to check function existence

2016-04-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14410:
--
Comment: was deleted

(was: User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/12183)

> SessionCatalog needs to check function existence 
> -
>
> Key: SPARK-14410
> URL: https://issues.apache.org/jira/browse/SPARK-14410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Right now, operations for an existing functions in SessionCatalog do not 
> really check if the function exists. We should add this check and avoid of 
> doing the check in command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14474) Move FileSource offset log into checkpointLocation

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14474:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Move FileSource offset log into checkpointLocation
> --
>
> Key: SPARK-14474
> URL: https://issues.apache.org/jira/browse/SPARK-14474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Now that we have a single location for storing checkpointed state, propagate 
> this information into the source so that we don't have one random log off on 
> its own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14474) Move FileSource offset log into checkpointLocation

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231334#comment-15231334
 ] 

Apache Spark commented on SPARK-14474:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/12247

> Move FileSource offset log into checkpointLocation
> --
>
> Key: SPARK-14474
> URL: https://issues.apache.org/jira/browse/SPARK-14474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Now that we have a single location for storing checkpointed state, propagate 
> this information into the source so that we don't have one random log off on 
> its own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14474) Move FileSource offset log into checkpointLocation

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14474:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Move FileSource offset log into checkpointLocation
> --
>
> Key: SPARK-14474
> URL: https://issues.apache.org/jira/browse/SPARK-14474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Now that we have a single location for storing checkpointed state, propagate 
> this information into the source so that we don't have one random log off on 
> its own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14474) Move FileSource offset log into checkpointLocation

2016-04-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-14474:


 Summary: Move FileSource offset log into checkpointLocation
 Key: SPARK-14474
 URL: https://issues.apache.org/jira/browse/SPARK-14474
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Now that we have a single location for storing checkpointed state, propagate 
this information into the source so that we don't have one random log off on 
its own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14473) Define analysis rules for operations not supported in streaming

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231330#comment-15231330
 ] 

Apache Spark commented on SPARK-14473:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/12246

> Define analysis rules for operations not supported in streaming
> ---
>
> Key: SPARK-14473
> URL: https://issues.apache.org/jira/browse/SPARK-14473
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> There are many operations that are currently not supported in the streaming 
> execution. For example:
> Some examples:
>  - joining two streams
>  - unioning a stream and a batch source
>  - sorting
>  - window functions (not time windows)
>  - distinct aggregates
> Furthermore, executing a query with a stream source as a batch query should 
> also fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14473) Define analysis rules for operations not supported in streaming

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14473:


Assignee: Apache Spark  (was: Tathagata Das)

> Define analysis rules for operations not supported in streaming
> ---
>
> Key: SPARK-14473
> URL: https://issues.apache.org/jira/browse/SPARK-14473
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> There are many operations that are currently not supported in the streaming 
> execution. For example:
> Some examples:
>  - joining two streams
>  - unioning a stream and a batch source
>  - sorting
>  - window functions (not time windows)
>  - distinct aggregates
> Furthermore, executing a query with a stream source as a batch query should 
> also fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14473) Define analysis rules for operations not supported in streaming

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14473:


Assignee: Tathagata Das  (was: Apache Spark)

> Define analysis rules for operations not supported in streaming
> ---
>
> Key: SPARK-14473
> URL: https://issues.apache.org/jira/browse/SPARK-14473
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> There are many operations that are currently not supported in the streaming 
> execution. For example:
> Some examples:
>  - joining two streams
>  - unioning a stream and a batch source
>  - sorting
>  - window functions (not time windows)
>  - distinct aggregates
> Furthermore, executing a query with a stream source as a batch query should 
> also fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14473) Define analysis rules for operations not supported in streaming

2016-04-07 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-14473:
--
Description: 
There are many operations that are currently not supported in the streaming 
execution. For example:
 - joining two streams
 - unioning a stream and a batch source
 - sorting
 - window functions (not time windows)
 - distinct aggregates

Furthermore, executing a query with a stream source as a batch query should 
also fail.


  was:
There are many operations that are currently not supported in the streaming 
execution. For example:

Some examples:
 - joining two streams
 - unioning a stream and a batch source
 - sorting
 - window functions (not time windows)
 - distinct aggregates

Furthermore, executing a query with a stream source as a batch query should 
also fail.



> Define analysis rules for operations not supported in streaming
> ---
>
> Key: SPARK-14473
> URL: https://issues.apache.org/jira/browse/SPARK-14473
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> There are many operations that are currently not supported in the streaming 
> execution. For example:
>  - joining two streams
>  - unioning a stream and a batch source
>  - sorting
>  - window functions (not time windows)
>  - distinct aggregates
> Furthermore, executing a query with a stream source as a batch query should 
> also fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14410) SessionCatalog needs to check function existence

2016-04-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14410.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12198
[https://github.com/apache/spark/pull/12198]

> SessionCatalog needs to check function existence 
> -
>
> Key: SPARK-14410
> URL: https://issues.apache.org/jira/browse/SPARK-14410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Right now, operations for an existing functions in SessionCatalog do not 
> really check if the function exists. We should add this check and avoid of 
> doing the check in command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14414) Make error messages consistent across DDLs

2016-04-07 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231321#comment-15231321
 ] 

Yin Huai commented on SPARK-14414:
--

Let's also take care 
https://github.com/apache/spark/pull/12198#discussion_r58955840 with this PR.

> Make error messages consistent across DDLs
> --
>
> Key: SPARK-14414
> URL: https://issues.apache.org/jira/browse/SPARK-14414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There are many different error messages right now when the user tries to run 
> something that's not supported. We might throw AnalysisException or 
> ParseException or NoSuchFunctionException etc. We should make all of these 
> consistent before 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14473) Define analysis rules for operations not supported in streaming

2016-04-07 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-14473:
-

 Summary: Define analysis rules for operations not supported in 
streaming
 Key: SPARK-14473
 URL: https://issues.apache.org/jira/browse/SPARK-14473
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


There are many operations that are currently not supported in the streaming 
execution. For example:

Some examples:
 - joining two streams
 - unioning a stream and a batch source
 - sorting
 - window functions (not time windows)
 - distinct aggregates

Furthermore, executing a query with a stream source as a batch query should 
also fail.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14472) Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from JavaCallable

2016-04-07 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231320#comment-15231320
 ] 

Bryan Cutler commented on SPARK-14472:
--

I'm working on it :D

> Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from 
> JavaCallable
> --
>
> Key: SPARK-14472
> URL: https://issues.apache.org/jira/browse/SPARK-14472
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Currently, JavaCallable is used to wrap a plain Java object and act as a 
> mixin to JavaModel to provide a convenient method to make Java calls to an 
> object defined in JavaWrapper.  The inheritance structure could be simplified 
> by defining the object in JavaCallable and use as a base class for 
> JavaWrapper.  Also, some renaming of these classes might better reflect their 
> purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14472) Cleanup PySpark-ML Java wrapper classes so that JavaWrapper will inherit from JavaCallable

2016-04-07 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-14472:


 Summary: Cleanup PySpark-ML Java wrapper classes so that 
JavaWrapper will inherit from JavaCallable
 Key: SPARK-14472
 URL: https://issues.apache.org/jira/browse/SPARK-14472
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Bryan Cutler
Priority: Minor


Currently, JavaCallable is used to wrap a plain Java object and act as a mixin 
to JavaModel to provide a convenient method to make Java calls to an object 
defined in JavaWrapper.  The inheritance structure could be simplified by 
defining the object in JavaCallable and use as a base class for JavaWrapper.  
Also, some renaming of these classes might better reflect their purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-07 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231308#comment-15231308
 ] 

DB Tsai commented on SPARK-13944:
-

For production use case, it's not desirable to include the whole Spark stack to 
use the linear algebra library or even models in Spark mllib, and a lot of 
time, those implementation can be standalone without depending on Spark 
platform. Due to the current mllib depending on Spark platform, if one wants to 
use it in production, it often causes jar conflict, and people end up 
reimplementing for production again. 

The goal for this PR is only separate our the local linear algebra out from 
mllib, and set up a build that we can provide the mllib-local jar. The long 
term goal will be gradually moving the platform independent code out from mllib 
to mllib-local, so people can easily use them in their production apps. 

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14471) The alias created in SELECT could be used in GROUP BY

2016-04-07 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14471:
--

 Summary: The alias created in SELECT could be used in GROUP BY
 Key: SPARK-14471
 URL: https://issues.apache.org/jira/browse/SPARK-14471
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu


This query should be able to run:

{code}

select a a1, count(1) from t group by a1

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14462) Add the mllib-local build to maven pom

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14462:


Assignee: DB Tsai  (was: Apache Spark)

> Add the mllib-local build to maven pom
> --
>
> Key: SPARK-14462
> URL: https://issues.apache.org/jira/browse/SPARK-14462
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Blocker
>
> In order to separate the linear algebra, and vector matrix classes into a 
> standalone jar, we need to setup the build first. This task will create a new 
> jar called mllib-local with minimal dependencies. The test scope will still 
> depend on spark-core and spark-core-test in order to use the common 
> utilities, but the runtime will avoid any platform dependency. Couple 
> platform independent classes will be moved to this package to demonstrate how 
> this work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14462) Add the mllib-local build to maven pom

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231299#comment-15231299
 ] 

Apache Spark commented on SPARK-14462:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/12241

> Add the mllib-local build to maven pom
> --
>
> Key: SPARK-14462
> URL: https://issues.apache.org/jira/browse/SPARK-14462
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Blocker
>
> In order to separate the linear algebra, and vector matrix classes into a 
> standalone jar, we need to setup the build first. This task will create a new 
> jar called mllib-local with minimal dependencies. The test scope will still 
> depend on spark-core and spark-core-test in order to use the common 
> utilities, but the runtime will avoid any platform dependency. Couple 
> platform independent classes will be moved to this package to demonstrate how 
> this work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14462) Add the mllib-local build to maven pom

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14462:


Assignee: Apache Spark  (was: DB Tsai)

> Add the mllib-local build to maven pom
> --
>
> Key: SPARK-14462
> URL: https://issues.apache.org/jira/browse/SPARK-14462
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Blocker
>
> In order to separate the linear algebra, and vector matrix classes into a 
> standalone jar, we need to setup the build first. This task will create a new 
> jar called mllib-local with minimal dependencies. The test scope will still 
> depend on spark-core and spark-core-test in order to use the common 
> utilities, but the runtime will avoid any platform dependency. Couple 
> platform independent classes will be moved to this package to demonstrate how 
> this work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14469) Remove mllib-local from mima project exclusion

2016-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231283#comment-15231283
 ] 

Sean Owen commented on SPARK-14469:
---

This seems like a duplicate or sub-task of your other JIRAs for mllib-local. 
Let's combine them? I don't see how this is a stand-alone task.

> Remove mllib-local from mima project exclusion
> --
>
> Key: SPARK-14469
> URL: https://issues.apache.org/jira/browse/SPARK-14469
> Project: Spark
>  Issue Type: Task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> We need to remove the exclude once 2.0 has been published and there is a 
> previous artifact for MiMa to compare against.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231276#comment-15231276
 ] 

Sean Owen commented on SPARK-13944:
---

I'm still not clear on the purpose of this change. I don't think Spark has a 
goal of providing local, non-distributed, non-Spark-based ML implementations. I 
can imagine providing a module of API classes only, but, that also does not 
seem to be the purpose here. What is in this "local" module and why?

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14127) [Table related commands] Describe table

2016-04-07 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231268#comment-15231268
 ] 

Xiao Li commented on SPARK-14127:
-

We will enable it in SQLContext. Move it to sql/core.

> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14470) Allow for overriding both httpclient and httpcore versions

2016-04-07 Thread Aaron Tokhy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231255#comment-15231255
 ] 

Aaron Tokhy commented on SPARK-14470:
-

Patch cleanly applies to both branch-1.6 and master

> Allow for overriding both httpclient and httpcore versions
> --
>
> Key: SPARK-14470
> URL: https://issues.apache.org/jira/browse/SPARK-14470
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Aaron Tokhy
>Priority: Minor
>
> The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 
> 'httpcore' versions are the same.  This restriction isn't necessarily true, 
> as you could potentially have an httpclient version of 4.3.6 and an httpcore 
> version of 4.3.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14470) Allow for overriding both httpclient and httpcore versions

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14470:


Assignee: (was: Apache Spark)

> Allow for overriding both httpclient and httpcore versions
> --
>
> Key: SPARK-14470
> URL: https://issues.apache.org/jira/browse/SPARK-14470
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Aaron Tokhy
>Priority: Minor
>
> The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 
> 'httpcore' versions are the same.  This restriction isn't necessarily true, 
> as you could potentially have an httpclient version of 4.3.6 and an httpcore 
> version of 4.3.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14470) Allow for overriding both httpclient and httpcore versions

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14470:


Assignee: Apache Spark

> Allow for overriding both httpclient and httpcore versions
> --
>
> Key: SPARK-14470
> URL: https://issues.apache.org/jira/browse/SPARK-14470
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Aaron Tokhy
>Assignee: Apache Spark
>Priority: Minor
>
> The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 
> 'httpcore' versions are the same.  This restriction isn't necessarily true, 
> as you could potentially have an httpclient version of 4.3.6 and an httpcore 
> version of 4.3.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14470) Allow for overriding both httpclient and httpcore versions

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231252#comment-15231252
 ] 

Apache Spark commented on SPARK-14470:
--

User 'atokhy' has created a pull request for this issue:
https://github.com/apache/spark/pull/12245

> Allow for overriding both httpclient and httpcore versions
> --
>
> Key: SPARK-14470
> URL: https://issues.apache.org/jira/browse/SPARK-14470
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Aaron Tokhy
>Priority: Minor
>
> The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 
> 'httpcore' versions are the same.  This restriction isn't necessarily true, 
> as you could potentially have an httpclient version of 4.3.6 and an httpcore 
> version of 4.3.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14468) Always enable OutputCommitCoordinator

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231235#comment-15231235
 ] 

Apache Spark commented on SPARK-14468:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12244

> Always enable OutputCommitCoordinator
> -
>
> Key: SPARK-14468
> URL: https://issues.apache.org/jira/browse/SPARK-14468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> The OutputCommitCoordinator was originally introduced in SPARK-4879 because 
> speculation causes the output of some partitions to be deleted. However, as 
> we can see in SPARK-10063, speculation is not the only case where this can 
> happen.
> More specifically, when we retry a stage we're not guaranteed to kill the 
> tasks that are still running (we don't even interrupt their threads), so we 
> may end up with multiple concurrent task attempts for the same task. This 
> leads to problems like SPARK-8029, but this fix alone is necessary but not 
> sufficient.
> In general, when we run into situations like these, we need the 
> OutputCommitCoordinator because we don't control what the underlying file 
> system does. Enabling this doesn't induce heavy performance costs so there's 
> little reason why we shouldn't always enable it to ensure correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14468) Always enable OutputCommitCoordinator

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14468:


Assignee: Andrew Or  (was: Apache Spark)

> Always enable OutputCommitCoordinator
> -
>
> Key: SPARK-14468
> URL: https://issues.apache.org/jira/browse/SPARK-14468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> The OutputCommitCoordinator was originally introduced in SPARK-4879 because 
> speculation causes the output of some partitions to be deleted. However, as 
> we can see in SPARK-10063, speculation is not the only case where this can 
> happen.
> More specifically, when we retry a stage we're not guaranteed to kill the 
> tasks that are still running (we don't even interrupt their threads), so we 
> may end up with multiple concurrent task attempts for the same task. This 
> leads to problems like SPARK-8029, but this fix alone is necessary but not 
> sufficient.
> In general, when we run into situations like these, we need the 
> OutputCommitCoordinator because we don't control what the underlying file 
> system does. Enabling this doesn't induce heavy performance costs so there's 
> little reason why we shouldn't always enable it to ensure correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14468) Always enable OutputCommitCoordinator

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14468:


Assignee: Apache Spark  (was: Andrew Or)

> Always enable OutputCommitCoordinator
> -
>
> Key: SPARK-14468
> URL: https://issues.apache.org/jira/browse/SPARK-14468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> The OutputCommitCoordinator was originally introduced in SPARK-4879 because 
> speculation causes the output of some partitions to be deleted. However, as 
> we can see in SPARK-10063, speculation is not the only case where this can 
> happen.
> More specifically, when we retry a stage we're not guaranteed to kill the 
> tasks that are still running (we don't even interrupt their threads), so we 
> may end up with multiple concurrent task attempts for the same task. This 
> leads to problems like SPARK-8029, but this fix alone is necessary but not 
> sufficient.
> In general, when we run into situations like these, we need the 
> OutputCommitCoordinator because we don't control what the underlying file 
> system does. Enabling this doesn't induce heavy performance costs so there's 
> little reason why we shouldn't always enable it to ensure correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14470) Allow for overriding both httpclient and httpcore versions

2016-04-07 Thread Aaron Tokhy (JIRA)
Aaron Tokhy created SPARK-14470:
---

 Summary: Allow for overriding both httpclient and httpcore versions
 Key: SPARK-14470
 URL: https://issues.apache.org/jira/browse/SPARK-14470
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.1, 2.0.0
Reporter: Aaron Tokhy
Priority: Minor


The Spark parent pom.xml assumes that the httpcomponents 'httpclient' and 
'httpcore' versions are the same.  This restriction isn't necessarily true, as 
you could potentially have an httpclient version of 4.3.6 and an httpcore 
version of 4.3.3.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-07 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231210#comment-15231210
 ] 

holdenk commented on SPARK-13842:
-

So the testing frameworks is a mixture of doctests along with the standard 
unittest2 stuff (in the tests.py file in each sub directory). Let me know if 
you have any questions while your doing this that I can help with :)

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-07 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231209#comment-15231209
 ] 

holdenk commented on SPARK-13842:
-

So the testing frameworks is a mixture of doctests along with the standard 
unittest2 stuff (in the tests.py file in each sub directory). Let me know if 
you have any questions while your doing this that I can help with :)

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14469) Remove mllib-local from mima project exclusion

2016-04-07 Thread DB Tsai (JIRA)
DB Tsai created SPARK-14469:
---

 Summary: Remove mllib-local from mima project exclusion
 Key: SPARK-14469
 URL: https://issues.apache.org/jira/browse/SPARK-14469
 Project: Spark
  Issue Type: Task
  Components: ML, MLlib
Affects Versions: 2.0.0
Reporter: DB Tsai


We need to remove the exclude once 2.0 has been published and there is a 
previous artifact for MiMa to compare against.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14469) Remove mllib-local from mima project exclusion

2016-04-07 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-14469:
---

Assignee: DB Tsai

> Remove mllib-local from mima project exclusion
> --
>
> Key: SPARK-14469
> URL: https://issues.apache.org/jira/browse/SPARK-14469
> Project: Spark
>  Issue Type: Task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> We need to remove the exclude once 2.0 has been published and there is a 
> previous artifact for MiMa to compare against.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-07 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231201#comment-15231201
 ] 

Shea Parkes commented on SPARK-13842:
-

I'm willing to give it a first pass.  Need to go dig up what Python testing 
framework you guys are using, but that shouldn't be too hard.

Unless anyone objects, I'd like to move StructType.names and 
StructType._needSerializeAnyField to properties at the same time.  Should be a 
seamless refactor and cut down on the likelihood of future errors.

Might even get to it tonight.

Thanks!

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14468) Always enable OutputCommitCoordinator

2016-04-07 Thread Andrew Or (JIRA)
Andrew Or created SPARK-14468:
-

 Summary: Always enable OutputCommitCoordinator
 Key: SPARK-14468
 URL: https://issues.apache.org/jira/browse/SPARK-14468
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or


The OutputCommitCoordinator was originally introduced in SPARK-4879 because 
speculation causes the output of some partitions to be deleted. However, as we 
can see in SPARK-10063, speculation is not the only case where this can happen.

More specifically, when we retry a stage we're not guaranteed to kill the tasks 
that are still running (we don't even interrupt their threads), so we may end 
up with multiple concurrent task attempts for the same task. This leads to 
problems like SPARK-8029, but this fix alone is necessary but not sufficient.

In general, when we run into situations like these, we need the 
OutputCommitCoordinator because we don't control what the underlying file 
system does. Enabling this doesn't induce heavy performance costs so there's 
little reason why we shouldn't always enable it to ensure correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14467) Add async io in FileScanRDD

2016-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14467:


Assignee: (was: Apache Spark)

> Add async io in FileScanRDD
> ---
>
> Key: SPARK-14467
> URL: https://issues.apache.org/jira/browse/SPARK-14467
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>
> Experiments running over parquet data in s3 shows poorly interleaving of CPU 
> and IO. We should do more async IO in FileScanRDD to better use the machine 
> resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14467) Add async io in FileScanRDD

2016-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231155#comment-15231155
 ] 

Apache Spark commented on SPARK-14467:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/12243

> Add async io in FileScanRDD
> ---
>
> Key: SPARK-14467
> URL: https://issues.apache.org/jira/browse/SPARK-14467
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>
> Experiments running over parquet data in s3 shows poorly interleaving of CPU 
> and IO. We should do more async IO in FileScanRDD to better use the machine 
> resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >