[jira] [Commented] (SPARK-20718) FileSourceScanExec with different filter orders should be the same after canonicalization

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007721#comment-16007721
 ] 

Apache Spark commented on SPARK-20718:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17962

> FileSourceScanExec with different filter orders should be the same after 
> canonicalization
> -
>
> Key: SPARK-20718
> URL: https://issues.apache.org/jira/browse/SPARK-20718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
> Usually this is ok because of canonicalization. However, in 
> `FileSourceScanExec`, its data filters and partition filters are sequences, 
> and their orders are not canonicalized. So `def sameResult` returns different 
> results for different orders of data/partition filters. This leads to, e.g. 
> different decision for `ReuseExchange`, and thus results in unstable 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20619) StringIndexer supports multiple ways of label ordering

2017-05-12 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20619.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> StringIndexer supports multiple ways of label ordering
> --
>
> Key: SPARK-20619
> URL: https://issues.apache.org/jira/browse/SPARK-20619
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.3.0
>
>
> StringIndexer maps labels to numbers according to the descending order of 
> label frequency. Other types of ordering (e.g., alphabetical) may be needed 
> in feature ETL. For example, the ordering will affect the result in one-hot 
> encoding and RFormula. Propose to support other ordering methods and we add a 
> parameter stringOrderType that supports the following four options:
>- 'freq_desc': descending order by label frequency (most frequent label 
> assigned 0)
>- 'freq_asc': ascending order by label frequency (least frequent label 
> assigned 0)
>- 'alphabet_desc': descending alphabetical order
>- 'alphabet_asc': ascending alphabetical order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-12 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007731#comment-16007731
 ] 

Jiang Xingbo commented on SPARK-20700:
--

I couldn't reproduce the failure on current master branch, the test case I use 
is like the following:
{code}
test("SPARK-20700: InferFiltersFromConstraints stackoverflows for query") {
withTempView("table_5") {
  withView("bools") {
sql(
  """CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, 
float_col_3, int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
|  ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, 
'571', TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
|  ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, 
'-278', TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
|  ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
|  ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), 
CAST(NULL AS INT), '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
|  ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, 
CAST(NULL AS STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
|  ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, 
'330', CAST(NULL AS TIMESTAMP), '-740'),
|  ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, 
'-766', CAST(NULL AS TIMESTAMP), CAST(NULL AS STRING)),
|  ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, 
'-514', CAST(NULL AS TIMESTAMP), '181'),
|  ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
|  ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, 
CAST(NULL AS STRING), CAST(NULL AS TIMESTAMP), '-62')
  """.
stripMargin)
sql("CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null)")

sql(
  """
  SELECT
|AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 
PRECEDING ) AS float_col,
|COUNT(t1.smallint_col_2) AS int_col
|FROM table_5 t1
|INNER JOIN (
|SELECT
|(MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
(t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
AS boolean_col,
|t2.a,
|(t1.int_col_4) * (t1.int_col_4) AS int_col
|FROM table_5 t1
|LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
|WHERE
|(t1.smallint_col_2) > (t1.smallint_col_2)
|GROUP BY
|t2.a,
|(t1.int_col_4) * (t1.int_col_4)
|HAVING
|((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * 
(t1.int_col_4), SUM(t1.int_col_4))
|) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = 
(t1.int_col_4))) AND ((t2.a) = (t1.smallint_col_2))
""".stripMargin)
  }
}
  }
{code}

> InferFiltersFromConstraints stackoverflows for query (v2)
> -
>
> Key: SPARK-20700
> URL: https://issues.apache.org/jira/browse/SPARK-20700
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>
> The following (complicated) query eventually fails with a stack overflow 
> during optimization:
> {code}
> CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
> int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
>   ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
> TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
>   ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
> TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
>   ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
> TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
>   ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
> '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
>   ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
> STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
>   ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
> CAST(NULL AS TIMESTAMP), '-740'),
>   ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
> AS TIMESTAMP), CAST(NULL AS STRING)),
>   ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
> CAST(NULL AS TIMESTAMP), '181'),
>   ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
> TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
>   ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
> STRING), CAST(NULL AS TIMESTAMP), '-62');
> CR

[jira] [Issue Comment Deleted] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-12 Thread Jiang Xingbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-20700:
-
Comment: was deleted

(was: I couldn't reproduce the failure on current master branch, the test case 
I use is like the following:
{code}
test("SPARK-20700: InferFiltersFromConstraints stackoverflows for query") {
withTempView("table_5") {
  withView("bools") {
sql(
  """CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, 
float_col_3, int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
|  ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, 
'571', TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
|  ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, 
'-278', TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
|  ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
|  ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), 
CAST(NULL AS INT), '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
|  ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, 
CAST(NULL AS STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
|  ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, 
'330', CAST(NULL AS TIMESTAMP), '-740'),
|  ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, 
'-766', CAST(NULL AS TIMESTAMP), CAST(NULL AS STRING)),
|  ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, 
'-514', CAST(NULL AS TIMESTAMP), '181'),
|  ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
|  ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, 
CAST(NULL AS STRING), CAST(NULL AS TIMESTAMP), '-62')
  """.
stripMargin)
sql("CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null)")

sql(
  """
  SELECT
|AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 
PRECEDING ) AS float_col,
|COUNT(t1.smallint_col_2) AS int_col
|FROM table_5 t1
|INNER JOIN (
|SELECT
|(MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
(t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
AS boolean_col,
|t2.a,
|(t1.int_col_4) * (t1.int_col_4) AS int_col
|FROM table_5 t1
|LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
|WHERE
|(t1.smallint_col_2) > (t1.smallint_col_2)
|GROUP BY
|t2.a,
|(t1.int_col_4) * (t1.int_col_4)
|HAVING
|((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * 
(t1.int_col_4), SUM(t1.int_col_4))
|) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = 
(t1.int_col_4))) AND ((t2.a) = (t1.smallint_col_2))
""".stripMargin)
  }
}
  }
{code})

> InferFiltersFromConstraints stackoverflows for query (v2)
> -
>
> Key: SPARK-20700
> URL: https://issues.apache.org/jira/browse/SPARK-20700
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>
> The following (complicated) query eventually fails with a stack overflow 
> during optimization:
> {code}
> CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
> int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
>   ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
> TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
>   ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
> TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
>   ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
> TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
>   ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
> '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
>   ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
> STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
>   ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
> CAST(NULL AS TIMESTAMP), '-740'),
>   ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
> AS TIMESTAMP), CAST(NULL AS STRING)),
>   ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
> CAST(NULL AS TIMESTAMP), '181'),
>   ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
> TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
>   ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
> STRING), CAST(NULL AS TIMESTAMP), '-62');
> CREATE VIEW bools(a, b) as 

[jira] [Commented] (SPARK-20703) Add an operator for writing data out

2017-05-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007738#comment-16007738
 ] 

Reynold Xin commented on SPARK-20703:
-

That and also Hive. We can do them one by one though.


> Add an operator for writing data out
> 
>
> Key: SPARK-20703
> URL: https://issues.apache.org/jira/browse/SPARK-20703
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> We should add an operator for writing data out. Right now in the explain plan 
> / UI there is no way to tell whether a query is writing data out, and also 
> there is no way to associate metrics with data writes. It'd be tremendously 
> valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for identical NaN feature

2017-05-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007739#comment-16007739
 ] 

Nick Pentreath commented on SPARK-20711:


Shouldn't the stats for any column that contains at least one {{NaN}} value be 
{{NaN}}?

> MultivariateOnlineSummarizer incorrect min/max for identical NaN feature
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for identical NaN feature

2017-05-12 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007748#comment-16007748
 ] 

zhengruifeng commented on SPARK-20711:
--

[~mlnick] It seems that in current implementation {{min/max}} will ignore 
{{NaN}}
{code}
scala> import org.apache.spark.mllib.stat._
import org.apache.spark.mllib.stat._

scala> val summarizer = new MultivariateOnlineSummarizer()
summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2d1f3639

scala> import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.{Vector, Vectors}

scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
res0: summarizer.type = 
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2d1f3639

scala> summarizer.add(Vectors.dense(Double.NaN, Double.NaN))
res1: summarizer.type = 
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2d1f3639

scala> summarizer.max
res2: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,-10.0]

scala> summarizer.mean
res3: org.apache.spark.mllib.linalg.Vector = [NaN,NaN]

scala> summarizer.count
res4: Long = 2

scala> summarizer.normL1
res5: org.apache.spark.mllib.linalg.Vector = [NaN,NaN]

scala> summarizer.normL2
res6: org.apache.spark.mllib.linalg.Vector = [NaN,NaN]

scala> summarizer.variance
res7: org.apache.spark.mllib.linalg.Vector = [NaN,NaN]
{code}

> MultivariateOnlineSummarizer incorrect min/max for identical NaN feature
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20718) FileSourceScanExec with different filter orders should be the same after canonicalization

2017-05-12 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-20718:
-
Description: 
Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
Usually this is ok because of canonicalization. However, in 
`FileSourceScanExec`, its data filters and partition filters are sequences, and 
their orders are not canonicalized. So `def sameResult` returns different 
results for different orders of data/partition filters. This leads to, e.g. 
different decision for `ReuseExchange`, and thus results in unstable 
performance.

The same issue exists in `HiveTableScanExec`.

  was:Since `constraints` in `QueryPlan` is a set, the order of filters can 
differ. Usually this is ok because of canonicalization. However, in 
`FileSourceScanExec`, its data filters and partition filters are sequences, and 
their orders are not canonicalized. So `def sameResult` returns different 
results for different orders of data/partition filters. This leads to, e.g. 
different decision for `ReuseExchange`, and thus results in unstable 
performance.


> FileSourceScanExec with different filter orders should be the same after 
> canonicalization
> -
>
> Key: SPARK-20718
> URL: https://issues.apache.org/jira/browse/SPARK-20718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
> Usually this is ok because of canonicalization. However, in 
> `FileSourceScanExec`, its data filters and partition filters are sequences, 
> and their orders are not canonicalized. So `def sameResult` returns different 
> results for different orders of data/partition filters. This leads to, e.g. 
> different decision for `ReuseExchange`, and thus results in unstable 
> performance.
> The same issue exists in `HiveTableScanExec`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20721) Exception thrown when doing spark submit

2017-05-12 Thread Sunil Sharma (JIRA)
Sunil Sharma created SPARK-20721:


 Summary: Exception thrown when doing spark submit
 Key: SPARK-20721
 URL: https://issues.apache.org/jira/browse/SPARK-20721
 Project: Spark
  Issue Type: Request
  Components: Deploy, DStreams
Affects Versions: 2.0.0
 Environment: Ubuntu
spark version : 2.0.0
kafka version  : kafka_2.11-0.8.2.2
Reporter: Sunil Sharma






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20721) Exception thrown when doing spark submit

2017-05-12 Thread Sunil Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007768#comment-16007768
 ] 

Sunil Sharma commented on SPARK-20721:
--

!http://www.host.com/image.gif!
or
!/home/sunilsharma991/Pictures/ApacheSpark.png!



> Exception thrown when doing spark submit
> 
>
> Key: SPARK-20721
> URL: https://issues.apache.org/jira/browse/SPARK-20721
> Project: Spark
>  Issue Type: Request
>  Components: Deploy, DStreams
>Affects Versions: 2.0.0
> Environment: Ubuntu
> spark version : 2.0.0
> kafka version  : kafka_2.11-0.8.2.2
>Reporter: Sunil Sharma
>  Labels: newbie
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20721) Exception thrown when doing spark submit

2017-05-12 Thread Sunil Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Sharma updated SPARK-20721:
-
Attachment: error.odt

Refer doc for better understanding of issue

> Exception thrown when doing spark submit
> 
>
> Key: SPARK-20721
> URL: https://issues.apache.org/jira/browse/SPARK-20721
> Project: Spark
>  Issue Type: Request
>  Components: Deploy, DStreams
>Affects Versions: 2.0.0
> Environment: Ubuntu
> spark version : 2.0.0
> kafka version  : kafka_2.11-0.8.2.2
>Reporter: Sunil Sharma
>  Labels: newbie
> Attachments: error.odt
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18772) Unnecessary conversion try and some missing cases for special floats in JSON

2017-05-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18772:
-
Summary: Unnecessary conversion try and some missing cases for special 
floats in JSON  (was: Parsing JSON with some NaN and Infinity values throws 
NumberFormatException)

> Unnecessary conversion try and some missing cases for special floats in JSON
> 
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> JacksonParser tests for infinite and NaN values in a way that is not 
> supported by the underlying float/double parser. For example, the input 
> string is always lowercased to check for {{-Infinity}} but the parser only 
> supports titlecased values. So a {{-infinitY}} will pass the test but fail 
> with a {{NumberFormatException}} when parsing. This exception is not caught 
> anywhere and the task ends up failing.
> A related issue is that the code checks for {{Inf}} but the parser only 
> supports the long form of {{Infinity}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18772) Unnecessary conversion try and some missing cases for special floats in JSON

2017-05-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007779#comment-16007779
 ] 

Hyukjin Kwon commented on SPARK-18772:
--

I am sorry for fixing the JIRA by myself - [~NathanHowell] as suggested in the 
PR. Please excuse me.

> Unnecessary conversion try and some missing cases for special floats in JSON
> 
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> It looks we can avoid some cases for unnecessary conversion try in special 
> floats in JSON.
> Also, we could support some other cases for them such as {{+INF}}, {{INF}} 
> and {{-INF}}.
> For avoiding additional conversions, please refer the codes below:
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> spark.read.schema(StructType(Seq(StructField("a", 
> DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
> "nan"}""").toDS).show()
> 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NumberFormatException: For input string: "nan"
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18772) Unnecessary conversion try and some missing cases for special floats in JSON

2017-05-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18772:
-
Description: 
It looks we can avoid some cases for unnecessary conversion try in special 
floats in JSON.

Also, we could support some other cases for them such as {{+INF}}, {{INF}} and 
{{-INF}}.

For avoiding additional conversions, please refer the codes below:

{code}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", 
DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
"nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
{code}




  was:
JacksonParser tests for infinite and NaN values in a way that is not supported 
by the underlying float/double parser. For example, the input string is always 
lowercased to check for {{-Infinity}} but the parser only supports titlecased 
values. So a {{-infinitY}} will pass the test but fail with a 
{{NumberFormatException}} when parsing. This exception is not caught anywhere 
and the task ends up failing.
A related issue is that the code checks for {{Inf}} but the parser only 
supports the long form of {{Infinity}}.


> Unnecessary conversion try and some missing cases for special floats in JSON
> 
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> It looks we can avoid some cases for unnecessary conversion try in special 
> floats in JSON.
> Also, we could support some other cases for them such as {{+INF}}, {{INF}} 
> and {{-INF}}.
> For avoiding additional conversions, please refer the codes below:
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> spark.read.schema(StructType(Seq(StructField("a", 
> DoubleType.option("mode", "FAILFAST").json(Seq("""{"a": 
> "nan"}""").toDS).show()
> 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NumberFormatException: For input string: "nan"
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20722) Replay event log that hasn't be replayed in current checking period in advance for request

2017-05-12 Thread sharkd tu (JIRA)
sharkd tu created SPARK-20722:
-

 Summary: Replay event log that hasn't be replayed in current 
checking period in advance for request
 Key: SPARK-20722
 URL: https://issues.apache.org/jira/browse/SPARK-20722
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: sharkd tu


History server may replay logs slowly if the size of event logs in current 
checking period is very large.  It will get stuck for a while before entering 
next  checking period, if we request a newer application history ui, we get the 
error like "Application application_1481785469354_934016 not found". We can let 
history server replay the newer event log in advance for request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20722) Replay event log that hasn't be replayed in current checking period in advance for request

2017-05-12 Thread sharkd tu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sharkd tu updated SPARK-20722:
--
Attachment: history-server2.png
history-server1.png

> Replay event log that hasn't be replayed in current checking period in 
> advance for request
> --
>
> Key: SPARK-20722
> URL: https://issues.apache.org/jira/browse/SPARK-20722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
> Attachments: history-server1.png, history-server2.png
>
>
> History server may replay logs slowly if the size of event logs in current 
> checking period is very large.  It will get stuck for a while before entering 
> next  checking period, if we request a newer application history ui, we get 
> the error like "Application application_1481785469354_934016 not found". We 
> can let history server replay the newer event log in advance for request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20722) Replay event log that hasn't be replayed in current checking period in advance for request

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007791#comment-16007791
 ] 

Apache Spark commented on SPARK-20722:
--

User 'sharkdtu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17963

> Replay event log that hasn't be replayed in current checking period in 
> advance for request
> --
>
> Key: SPARK-20722
> URL: https://issues.apache.org/jira/browse/SPARK-20722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
> Attachments: history-server1.png, history-server2.png
>
>
> History server may replay logs slowly if the size of event logs in current 
> checking period is very large.  It will get stuck for a while before entering 
> next  checking period, if we request a newer application history ui, we get 
> the error like "Application application_1481785469354_934016 not found". We 
> can let history server replay the newer event log in advance for request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20722) Replay event log that hasn't be replayed in current checking period in advance for request

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20722:


Assignee: (was: Apache Spark)

> Replay event log that hasn't be replayed in current checking period in 
> advance for request
> --
>
> Key: SPARK-20722
> URL: https://issues.apache.org/jira/browse/SPARK-20722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
> Attachments: history-server1.png, history-server2.png
>
>
> History server may replay logs slowly if the size of event logs in current 
> checking period is very large.  It will get stuck for a while before entering 
> next  checking period, if we request a newer application history ui, we get 
> the error like "Application application_1481785469354_934016 not found". We 
> can let history server replay the newer event log in advance for request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20721) Exception thrown when doing spark submit

2017-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-20721.
-

> Exception thrown when doing spark submit
> 
>
> Key: SPARK-20721
> URL: https://issues.apache.org/jira/browse/SPARK-20721
> Project: Spark
>  Issue Type: Request
>  Components: Deploy, DStreams
>Affects Versions: 2.0.0
> Environment: Ubuntu
> spark version : 2.0.0
> kafka version  : kafka_2.11-0.8.2.2
>Reporter: Sunil Sharma
>  Labels: newbie
> Attachments: error.odt
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20722) Replay event log that hasn't be replayed in current checking period in advance for request

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20722:


Assignee: Apache Spark

> Replay event log that hasn't be replayed in current checking period in 
> advance for request
> --
>
> Key: SPARK-20722
> URL: https://issues.apache.org/jira/browse/SPARK-20722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Assignee: Apache Spark
> Attachments: history-server1.png, history-server2.png
>
>
> History server may replay logs slowly if the size of event logs in current 
> checking period is very large.  It will get stuck for a while before entering 
> next  checking period, if we request a newer application history ui, we get 
> the error like "Application application_1481785469354_934016 not found". We 
> can let history server replay the newer event log in advance for request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20721) Exception thrown when doing spark submit

2017-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20721.
---
Resolution: Invalid

This is not at all appropriate for JIRA

> Exception thrown when doing spark submit
> 
>
> Key: SPARK-20721
> URL: https://issues.apache.org/jira/browse/SPARK-20721
> Project: Spark
>  Issue Type: Request
>  Components: Deploy, DStreams
>Affects Versions: 2.0.0
> Environment: Ubuntu
> spark version : 2.0.0
> kafka version  : kafka_2.11-0.8.2.2
>Reporter: Sunil Sharma
>  Labels: newbie
> Attachments: error.odt
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20639) Add single argument support for to_timestamp in SQL

2017-05-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20639.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17901
[https://github.com/apache/spark/pull/17901]

> Add single argument support for to_timestamp in SQL
> ---
>
> Key: SPARK-20639
> URL: https://issues.apache.org/jira/browse/SPARK-20639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, it looks we can omit the timestamp format as below:
> {code}
> import org.apache.spark.sql.functions._
> Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
> {code}
> {code}
> ++
> |to_timestamp(`a`, '-MM-dd HH:mm:ss')|
> ++
> | 2016-12-31 00:12:00|
> ++
> {code}
> whereas this does not work in SQL as below:
> {code}
> spark-sql> SELECT to_timestamp('2016-12-31 00:12:00.00');
> Error in query: Invalid number of arguments for function to_timestamp; line 1 
> pos 7
> {code}
> It looks we could support this too. For {{to_date}}, it looks already working 
> in SQL as well as other language APIs.
> {code}
> scala> Seq("2016-12-31").toDF("a").select(to_date(col("a"))).show()
> +--+
> |to_date(a)|
> +--+
> |2016-12-31|
> +--+
> {code}
> {code}
> spark-sql> SELECT to_date('2016-12-31');
> 2016-12-31
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20639) Add single argument support for to_timestamp in SQL

2017-05-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20639:
---

Assignee: Hyukjin Kwon

> Add single argument support for to_timestamp in SQL
> ---
>
> Key: SPARK-20639
> URL: https://issues.apache.org/jira/browse/SPARK-20639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, it looks we can omit the timestamp format as below:
> {code}
> import org.apache.spark.sql.functions._
> Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
> {code}
> {code}
> ++
> |to_timestamp(`a`, '-MM-dd HH:mm:ss')|
> ++
> | 2016-12-31 00:12:00|
> ++
> {code}
> whereas this does not work in SQL as below:
> {code}
> spark-sql> SELECT to_timestamp('2016-12-31 00:12:00.00');
> Error in query: Invalid number of arguments for function to_timestamp; line 1 
> pos 7
> {code}
> It looks we could support this too. For {{to_date}}, it looks already working 
> in SQL as well as other language APIs.
> {code}
> scala> Seq("2016-12-31").toDF("a").select(to_date(col("a"))).show()
> +--+
> |to_date(a)|
> +--+
> |2016-12-31|
> +--+
> {code}
> {code}
> spark-sql> SELECT to_date('2016-12-31');
> 2016-12-31
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20720) 'Executor Summary' should show the exact number, 'Removed Executors' should display the specific number, in the Application Page

2017-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20720:
--
Affects Version/s: (was: 2.2.1)
   (was: 2.3.0)
   (was: 2.2.0)
 Priority: Trivial  (was: Minor)

> 'Executor Summary' should show the exact number, 'Removed Executors' should 
> display the specific number, in the Application Page
> 
>
> Key: SPARK-20720
> URL: https://issues.apache.org/jira/browse/SPARK-20720
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1
>Reporter: guoxiaolongzte
>Priority: Trivial
> Attachments: executor.png
>
>
> When the number of spark worker executors is large, if the specific number is 
> displayed, will better help us to analyze and observe by spark ui. 
> Although this is a small improvement, but it is indeed very valuable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20554) Remove usage of scala.language.reflectiveCalls

2017-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20554:
-

   Assignee: Sean Owen
   Priority: Trivial  (was: Minor)
Description: 
In several parts of the code we have imported 
{{scala.language.reflectiveCalls}} to suppress a warning about, well, 
reflective calls. I know from cleaning up build warnings in 2.2 that in some 
cases of this are inadvertent and masking a type problem.

Example, in HiveDDLSuite:

{code}
val expectedTablePath =
  if (dbPath.isEmpty) {
hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
  } else {
new Path(new Path(dbPath.get), tableIdentifier.table)
  }
val filesystemPath = new Path(expectedTablePath.toString)
{code}

This shouldn't really work because one branch returns a URI and the other a 
Path. In this case it only needs an object with a toString method and can make 
this work with structural types and reflection.

Obviously, the intent was to add ".toURI" to the second branch though to make 
both a URI!

I think we should probably clean this up by taking out all imports of 
reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
legit usages.


  was:
In several parts of the code we have imported 
{{scala.language.reflectiveCalls}} to suppress a warning about, well, 
reflective calls. I know from cleaning up build warnings in 2.2 that in almost 
all cases of this are inadvertent and masking a type problem.

Example, in HiveDDLSuite:

{code}
val expectedTablePath =
  if (dbPath.isEmpty) {
hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
  } else {
new Path(new Path(dbPath.get), tableIdentifier.table)
  }
val filesystemPath = new Path(expectedTablePath.toString)
{code}

This shouldn't really work because one branch returns a URI and the other a 
Path. In this case it only needs an object with a toString method and can make 
this work with structural types and reflection.

Obviously, the intent was to add ".toURI" to the second branch though to make 
both a URI!

I think we should probably clean this up by taking out all imports of 
reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
legit usages.



> Remove usage of scala.language.reflectiveCalls
> --
>
> Key: SPARK-20554
> URL: https://issues.apache.org/jira/browse/SPARK-20554
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.2.0
>
>
> In several parts of the code we have imported 
> {{scala.language.reflectiveCalls}} to suppress a warning about, well, 
> reflective calls. I know from cleaning up build warnings in 2.2 that in some 
> cases of this are inadvertent and masking a type problem.
> Example, in HiveDDLSuite:
> {code}
> val expectedTablePath =
>   if (dbPath.isEmpty) {
> hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
>   } else {
> new Path(new Path(dbPath.get), tableIdentifier.table)
>   }
> val filesystemPath = new Path(expectedTablePath.toString)
> {code}
> This shouldn't really work because one branch returns a URI and the other a 
> Path. In this case it only needs an object with a toString method and can 
> make this work with structural types and reflection.
> Obviously, the intent was to add ".toURI" to the second branch though to make 
> both a URI!
> I think we should probably clean this up by taking out all imports of 
> reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
> legit usages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20554) Remove usage of scala.language.reflectiveCalls

2017-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20554.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17949
[https://github.com/apache/spark/pull/17949]

> Remove usage of scala.language.reflectiveCalls
> --
>
> Key: SPARK-20554
> URL: https://issues.apache.org/jira/browse/SPARK-20554
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Priority: Minor
> Fix For: 2.2.0
>
>
> In several parts of the code we have imported 
> {{scala.language.reflectiveCalls}} to suppress a warning about, well, 
> reflective calls. I know from cleaning up build warnings in 2.2 that in 
> almost all cases of this are inadvertent and masking a type problem.
> Example, in HiveDDLSuite:
> {code}
> val expectedTablePath =
>   if (dbPath.isEmpty) {
> hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
>   } else {
> new Path(new Path(dbPath.get), tableIdentifier.table)
>   }
> val filesystemPath = new Path(expectedTablePath.toString)
> {code}
> This shouldn't really work because one branch returns a URI and the other a 
> Path. In this case it only needs an object with a toString method and can 
> make this work with structural types and reflection.
> Obviously, the intent was to add ".toURI" to the second branch though to make 
> both a URI!
> I think we should probably clean this up by taking out all imports of 
> reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
> legit usages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-05-12 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007823#comment-16007823
 ] 

Wenchen Fan commented on SPARK-19122:
-

I tried the example but can't reproduce this issue, there is no shuffle in the 
second query. I checked with `HashPartitioning.satisfies`, it doesn't consider 
the expressions order.

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20624) Add better handling for node shutdown

2017-05-12 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-20624:

Summary: Add better handling for node shutdown  (was: Consider adding 
better handling for node shutdown)

> Add better handling for node shutdown
> -
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.30
>Reporter: holdenk
>Priority: Minor
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-12 Thread madhukara phatak (JIRA)
madhukara phatak created SPARK-20723:


 Summary: Random Forest Classifier should expose 
intermediateRDDStorageLevel similar to ALS
 Key: SPARK-20723
 URL: https://issues.apache.org/jira/browse/SPARK-20723
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.0
Reporter: madhukara phatak
Priority: Minor


Currently Random Forest implementation cache as the intermediatery data using 
*MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
So we should expose an expert param *intermediateRDDStorageLevel* which allows 
user to customise the storage level. This is similar to als options like 
specified in below jira

https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20706) Spark-shell not overriding method definition

2017-05-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007974#comment-16007974
 ] 

Hyukjin Kwon commented on SPARK-20706:
--

(I can't reproduce this in Scala 2.11.6 too)

> Spark-shell not overriding method definition
> 
>
> Key: SPARK-20706
> URL: https://issues.apache.org/jira/browse/SPARK-20706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: Linux, Scala 2.11.8
>Reporter: Raphael Roth
>Priority: Minor
>
> In the following example, the definition of myMethod is not correctly updated:
> --
> def myMethod()  = "first definition"
> val tmp = myMethod(); val out = tmp
> println(out) // prints "first definition"
> def myMethod()  = "second definition" // override above myMethod
> val tmp = myMethod(); val out = tmp 
> println(out) // should be "second definition" but is "first definition"
> --
> I'm using semicolon to force two statements to be compiled at the same time. 
> It's also possible to reproduce the behavior using :paste
> So if I-redefine myMethod, the implementation seems not to be updated in this 
> case. I figured out that the second-last statement (val out = tmp) causes 
> this behavior, if this is moved in a separate block, the code works just fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20724) spark-submit verbose mode should list default settings values

2017-05-12 Thread Michel Lemay (JIRA)
Michel Lemay created SPARK-20724:


 Summary: spark-submit verbose mode should list default settings 
values
 Key: SPARK-20724
 URL: https://issues.apache.org/jira/browse/SPARK-20724
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, SQL
Affects Versions: 2.1.0
Reporter: Michel Lemay
Priority: Minor


When debugging an application, we must at times be able to see the default 
values for some configuration but there is to my knowledge no means to do so.. 
spark-summit --verbose do not show default values, only modified ones. I don't 
want to go to github for specific tag/version of the code to guess what a 
default value might be. We need to be able to see what every bit of settings 
including default values.

See related jira about SQLConf.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18990) make DatasetBenchmark fairer for Dataset

2017-05-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008009#comment-16008009
 ] 

Takeshi Yamamuro commented on SPARK-18990:
--

cc: [~cloud_fan] Since the pr above has been merged, could we close this?

> make DatasetBenchmark fairer for Dataset
> 
>
> Key: SPARK-18990
> URL: https://issues.apache.org/jira/browse/SPARK-18990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20706) Spark-shell not overriding method definition

2017-05-12 Thread Raphael Roth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008010#comment-16008010
 ] 

Raphael Roth commented on SPARK-20706:
--

also tested with scala console 2.11.8, this works fine. So I assume the bug is 
in spark-shell itself

> Spark-shell not overriding method definition
> 
>
> Key: SPARK-20706
> URL: https://issues.apache.org/jira/browse/SPARK-20706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: Linux, Scala 2.11.8
>Reporter: Raphael Roth
>Priority: Minor
>
> In the following example, the definition of myMethod is not correctly updated:
> --
> def myMethod()  = "first definition"
> val tmp = myMethod(); val out = tmp
> println(out) // prints "first definition"
> def myMethod()  = "second definition" // override above myMethod
> val tmp = myMethod(); val out = tmp 
> println(out) // should be "second definition" but is "first definition"
> --
> I'm using semicolon to force two statements to be compiled at the same time. 
> It's also possible to reproduce the behavior using :paste
> So if I-redefine myMethod, the implementation seems not to be updated in this 
> case. I figured out that the second-last statement (val out = tmp) causes 
> this behavior, if this is moved in a separate block, the code works just fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18990) make DatasetBenchmark fairer for Dataset

2017-05-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18990.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> make DatasetBenchmark fairer for Dataset
> 
>
> Key: SPARK-18990
> URL: https://issues.apache.org/jira/browse/SPARK-18990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect

2017-05-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17424.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0
   2.0.3

Issue resolved by pull request 15062
[https://github.com/apache/spark/pull/15062]

> Dataset job fails from unsound substitution in ScalaReflect
> ---
>
> Key: SPARK-17424
> URL: https://issues.apache.org/jira/browse/SPARK-17424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Ryan Blue
> Fix For: 2.0.3, 2.2.0, 2.1.2
>
>
> I have a job that uses datasets in 1.6.1 and is failing with this error:
> {code}
> 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
> at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
> at scala.reflect.internal.Types$Type.subst(Types.scala:796)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
> at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache

[jira] [Assigned] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect

2017-05-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-17424:
---

Assignee: Ryan Blue

> Dataset job fails from unsound substitution in ScalaReflect
> ---
>
> Key: SPARK-17424
> URL: https://issues.apache.org/jira/browse/SPARK-17424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> I have a job that uses datasets in 1.6.1 and is failing with this error:
> {code}
> 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
> at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
> at scala.reflect.internal.Types$Type.subst(Types.scala:796)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
> at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.Sp

[jira] [Assigned] (SPARK-20710) Support aliases in CUBE/ROLLUP/GROUPING SETS

2017-05-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20710:
---

Assignee: Takeshi Yamamuro

> Support aliases in CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-20710
> URL: https://issues.apache.org/jira/browse/SPARK-20710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> The current master supports regular group-by aliases though, it does not 
> support for CUBE/ROLLUP/GROUPING SETS.
> sql("""
>   CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
>   (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)
>   AS testData(a, b)
> """)
> sql("SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k 
> GROUPING SETS(k)").show
> scala> sql("SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k 
> GROUPING SETS(k)").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`k`' given input 
> columns: [a, b]; line 1 pos 79;
> 'GroupingSets [ArrayBuffer('k)], [(a#61 + b#62), 'k], [(a#61 + b#62), b#62, 
> sum(cast((a#61 - b#62) as bigint))]
> +- SubqueryAlias testdata
>+- Project [a#61, b#62]
>   +- SubqueryAlias testData
>  +- LocalRelation [a#61, b#62]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scal



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20710) Support aliases in CUBE/ROLLUP/GROUPING SETS

2017-05-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20710.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17948
[https://github.com/apache/spark/pull/17948]

> Support aliases in CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-20710
> URL: https://issues.apache.org/jira/browse/SPARK-20710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> The current master supports regular group-by aliases though, it does not 
> support for CUBE/ROLLUP/GROUPING SETS.
> sql("""
>   CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
>   (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)
>   AS testData(a, b)
> """)
> sql("SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k 
> GROUPING SETS(k)").show
> scala> sql("SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k 
> GROUPING SETS(k)").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`k`' given input 
> columns: [a, b]; line 1 pos 79;
> 'GroupingSets [ArrayBuffer('k)], [(a#61 + b#62), 'k], [(a#61 + b#62), b#62, 
> sum(cast((a#61 - b#62) as bigint))]
> +- SubqueryAlias testdata
>+- Project [a#61, b#62]
>   +- SubqueryAlias testData
>  +- LocalRelation [a#61, b#62]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scal



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20472) Support for Dynamic Configuration

2017-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20472.
---
Resolution: Not A Problem

> Support for Dynamic Configuration
> -
>
> Key: SPARK-20472
> URL: https://issues.apache.org/jira/browse/SPARK-20472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.0
>Reporter: Shahbaz Hussain
>
> Currently Spark Configuration can not be dynamically changed.
> It requires Spark Job be killed and started again for a new configuration to 
> take in to effect.
> This bug is to enhance Spark ,such that configuration changes can be 
> dynamically changed without requiring a application restart.
> Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to 
> reduce it to 5 seconds,currently it requires a re-submit of the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19668) Multiple NGram sizes

2017-05-12 Thread Zhe Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008113#comment-16008113
 ] 

Zhe Sun commented on SPARK-19668:
-

Is there any progress on this issue? [~mlnick]

If nobody pick it up, I can implement it, and pay extra attention on backward 
compat for save/load.



> Multiple NGram sizes
> 
>
> Key: SPARK-19668
> URL: https://issues.apache.org/jira/browse/SPARK-19668
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Jacek KK
>Priority: Minor
>  Labels: beginner, easyfix, newbie
>
> It would be nice to have a possibility of specyfing the range (or maybe a 
> list of) sizes of ngrams, like it is done in sklearn, as in 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer
> This shouldn't be difficult to add, the code is very straightforward, and I 
> can implement it. The only issue is with the NGram API - should it just 
> accept a number/tuple/list?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2017-05-12 Thread Biagio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008163#comment-16008163
 ] 

Biagio commented on SPARK-18492:


Same Error as Rupinder and sskadarkar when using the "window" function with 
lower value of the parameter "slideDuration".

Is there any workaround for this issue? 

I'm wondering if this is an issue related to the 
org.apache.spark.sql.functions.window but like [~nchammas] pointed out, It 
seems that there are multiple issues related to this error, so I'm guessing 
it's not.


> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12

[jira] [Commented] (SPARK-20725) partial aggregate should behave correctly for sameResult

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008222#comment-16008222
 ] 

Apache Spark commented on SPARK-20725:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17964

> partial aggregate should behave correctly for sameResult
> 
>
> Key: SPARK-20725
> URL: https://issues.apache.org/jira/browse/SPARK-20725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20725) partial aggregate should behave correctly for sameResult

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20725:


Assignee: Wenchen Fan  (was: Apache Spark)

> partial aggregate should behave correctly for sameResult
> 
>
> Key: SPARK-20725
> URL: https://issues.apache.org/jira/browse/SPARK-20725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20725) partial aggregate should behave correctly for sameResult

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20725:


Assignee: Apache Spark  (was: Wenchen Fan)

> partial aggregate should behave correctly for sameResult
> 
>
> Key: SPARK-20725
> URL: https://issues.apache.org/jira/browse/SPARK-20725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20725) partial aggregate should behave correctly for sameResult

2017-05-12 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20725:
---

 Summary: partial aggregate should behave correctly for sameResult
 Key: SPARK-20725
 URL: https://issues.apache.org/jira/browse/SPARK-20725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-05-12 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008266#comment-16008266
 ] 

Tejas Patil commented on SPARK-19122:
-

[~cloud_fan]: 
- The test case in the [associated PR 
(#16985)|https://github.com/apache/spark/pull/16985] fails without the fix 
- I am able to repro this issue over master branch. Can you share exact steps 
that you used to repro ? I am guessing that 
`spark.sql.autoBroadcastJoinThreshold` needs to be overridden otherwise it wont 
pick sort merge join. Here are my exact steps to repro the example in the jira 
description:

{noformat}
$ git log
commit 92ea7fd7b6cd4641b2f02b97105835029ddadc5f
Author: Takeshi Yamamuro 
Date:   Fri May 12 20:48:30 2017 +0800

build/sbt -Pyarn -Phadoop-2.4 -Phive package assembly/package
export SPARK_PREPEND_CLASSES=true
SPARK_LOCAL_IP=127.0.0.1 ./bin/spark-shell
{noformat}

In spark shell:
{noformat}
import org.apache.spark.sql._
val hc = SparkSession.builder.master("local").enableHiveSupport.getOrCreate()
hc.sql(" DROP TABLE table1 ")
hc.sql(" DROP TABLE table2 ")
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)

hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")


scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Join Inner, (('a.j = 'b.j) && ('a.k = 'b.k))
   :- 'SubqueryAlias a
   :  +- 'UnresolvedRelation `table1`
   +- 'SubqueryAlias b
  +- 'UnresolvedRelation `table2`

== Analyzed Logical Plan ==
i: int, j: int, k: string, i: int, j: int, k: string
Project [i#86, j#87, k#88, i#89, j#90, k#91]
+- Join Inner, ((j#87 = j#90) && (k#88 = k#91))
   :- SubqueryAlias a
   :  +- SubqueryAlias table1
   : +- Relation[i#86,j#87,k#88] orc
   +- SubqueryAlias b
  +- SubqueryAlias table2
 +- Relation[i#89,j#90,k#91] orc

== Optimized Logical Plan ==
Join Inner, ((j#87 = j#90) && (k#88 = k#91))
:- Filter (isnotnull(j#87) && isnotnull(k#88))
:  +- Relation[i#86,j#87,k#88] orc
+- Filter (isnotnull(j#90) && isnotnull(k#91))
   +- Relation[i#89,j#90,k#91] orc

== Physical Plan ==
*SortMergeJoin [j#87, k#88], [j#90, k#91], Inner
:- *Project [i#86, j#87, k#88]
:  +- *Filter (isnotnull(j#87) && isnotnull(k#88))
: +- *FileScan orc default.table1[i#86,j#87,k#88] Batched: false, Format: 
ORC, Location: 
InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/table1],
 PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
+- *Project [i#89, j#90, k#91]
   +- *Filter (isnotnull(j#90) && isnotnull(k#91))
  +- *FileScan orc default.table2[i#89,j#90,k#91] Batched: false, Format: 
ORC, Location: 
InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/table2],
 PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct



scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND 
a.j=b.j").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Join Inner, (('a.k = 'b.k) && ('a.j = 'b.j))
   :- 'SubqueryAlias a
   :  +- 'UnresolvedRelation `table1`
   +- 'SubqueryAlias b
  +- 'UnresolvedRelation `table2`

== Analyzed Logical Plan ==
i: int, j: int, k: string, i: int, j: int, k: string
Project [i#106, j#107, k#108, i#109, j#110, k#111]
+- Join Inner, ((k#108 = k#111) && (j#107 = j#110))
   :- SubqueryAlias a
   :  +- SubqueryAlias table1
   : +- Relation[i#106,j#107,k#108] orc
   +- SubqueryAlias b
  +- SubqueryAlias table2
 +- Relation[i#109,j#110,k#111] orc

== Optimized Logical Plan ==
Join Inner, ((k#108 = k#111) && (j#107 = j#110))
:- Filter (isnotnull(j#107) && isnotnull(k#108))
:  +- Relation[i#106,j#107,k#108] orc
+- Filter (isnotnull(k#111) && isnotnull(j#110))
   +- Relation[i#109,j#110,k#111] orc

== Physical Plan ==
*SortMergeJoin [k#108, j#107], [k#111, j#110], Inner
:- *Sort [k#108 ASC NULLS FIRST, j#107 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#108, j#107, 200)
: +- *Project [i#106, j#107, k#108]
:+- *Filter (isnotnull(j#107) && isnotnull(k#108))
:   +- *FileScan orc default.table1[i#106,j#107,k#108] Batched: false, 
Format: ORC, Location: 
InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/apache-hive-1.2.1-bin/warehouse/table1],
 PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
+- *Sort [k#111 ASC NULLS FIRST, j#110 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k#111, j#110, 200)
  +- *Project [i#109, j#110, k#111]
 +- *Filter (isnotnull(k#111) && isnotnull(j#110))
+- *FileScan orc defa

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2017-05-12 Thread Guangyang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008284#comment-16008284
 ] 

Guangyang Li commented on SPARK-16534:
--

I also hope there could be Python support for Kafka 0.10. The main issue is 
that Spark has the support with Kafka 0.8 and we are using it for production. 
While I really don't see the point that Spark stops it from updating to 0.10.

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-12 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008286#comment-16008286
 ] 

Jiang Xingbo commented on SPARK-20700:
--

I've reproduced this case, will dive further into it this weekend.

> InferFiltersFromConstraints stackoverflows for query (v2)
> -
>
> Key: SPARK-20700
> URL: https://issues.apache.org/jira/browse/SPARK-20700
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>
> The following (complicated) query eventually fails with a stack overflow 
> during optimization:
> {code}
> CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
> int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
>   ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
> TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
>   ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
> TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
>   ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
> TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
>   ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
> '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
>   ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
> STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
>   ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
> CAST(NULL AS TIMESTAMP), '-740'),
>   ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
> AS TIMESTAMP), CAST(NULL AS STRING)),
>   ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
> CAST(NULL AS TIMESTAMP), '181'),
>   ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
> TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
>   ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
> STRING), CAST(NULL AS TIMESTAMP), '-62');
> CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null);
> SELECT
> AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS 
> float_col,
> COUNT(t1.smallint_col_2) AS int_col
> FROM table_5 t1
> INNER JOIN (
> SELECT
> (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
> (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
> AS boolean_col,
> t2.a,
> (t1.int_col_4) * (t1.int_col_4) AS int_col
> FROM table_5 t1
> LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
> WHERE
> (t1.smallint_col_2) > (t1.smallint_col_2)
> GROUP BY
> t2.a,
> (t1.int_col_4) * (t1.int_col_4)
> HAVING
> ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), 
> SUM(t1.int_col_4))
> ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND 
> ((t2.a) = (t1.smallint_col_2));
> {code}
> (I haven't tried to minimize this failing case yet).
> Based on sampled jstacks from the driver, it looks like the query might be 
> repeatedly inferring filters from constraints and then pruning those filters.
> Here's part of the stack at the point where it stackoverflows:
> {code}
> [... repeats ...]
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:344)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.Traversab

[jira] [Resolved] (SPARK-20704) CRAN test should run single threaded

2017-05-12 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20704.
--
  Resolution: Fixed
Assignee: Felix Cheung
   Fix Version/s: 2.3.0
  2.2.0
Target Version/s: 2.2.0, 2.3.0

> CRAN test should run single threaded
> 
>
> Key: SPARK-20704
> URL: https://issues.apache.org/jira/browse/SPARK-20704
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20726) R wrapper for SQL broadcast

2017-05-12 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-20726:
--

 Summary: R wrapper for SQL broadcast
 Key: SPARK-20726
 URL: https://issues.apache.org/jira/browse/SPARK-20726
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz
Priority: Minor


Add R wrapper for {{o.a.s.sql.functions.broadcast}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20726) R wrapper for SQL broadcast

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008351#comment-16008351
 ] 

Apache Spark commented on SPARK-20726:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17965

> R wrapper for SQL broadcast
> ---
>
> Key: SPARK-20726
> URL: https://issues.apache.org/jira/browse/SPARK-20726
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Add R wrapper for {{o.a.s.sql.functions.broadcast}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20726) R wrapper for SQL broadcast

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20726:


Assignee: Apache Spark

> R wrapper for SQL broadcast
> ---
>
> Key: SPARK-20726
> URL: https://issues.apache.org/jira/browse/SPARK-20726
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> Add R wrapper for {{o.a.s.sql.functions.broadcast}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20726) R wrapper for SQL broadcast

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20726:


Assignee: (was: Apache Spark)

> R wrapper for SQL broadcast
> ---
>
> Key: SPARK-20726
> URL: https://issues.apache.org/jira/browse/SPARK-20726
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Add R wrapper for {{o.a.s.sql.functions.broadcast}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19951) Add string concatenate operator || to Spark SQL

2017-05-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19951.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Add string concatenate operator || to Spark SQL
> ---
>
> Key: SPARK-19951
> URL: https://issues.apache.org/jira/browse/SPARK-19951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> It is quite natural to concatenate strings using the {||} symbol. For 
> example: {{select a || b || c as abc from tbl_x}}. Let's add to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1902) Spark shell prints error when :4040 port already in use

2017-05-12 Thread dud (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008443#comment-16008443
 ] 

dud commented on SPARK-1902:


Hello

I'm also getting this long warning message on Spark 2.1.1.
I just copied log4j.properties.template to log4j.properties and this long 
stackstrace is now gone.
I don't know why this log4j configuration is not applied by default.

> Spark shell prints error when :4040 port already in use
> ---
>
> Key: SPARK-1902
> URL: https://issues.apache.org/jira/browse/SPARK-1902
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
> Fix For: 1.1.0
>
>
> When running two shells on the same machine, I get the below error.  The 
> issue is that the first shell takes port 4040, then the next tries tries 4040 
> and fails so falls back to 4041, then a third would try 4040 and 4041 before 
> landing on 4042, etc.
> We should catch the error and instead log as "Unable to use port 4041; 
> already in use.  Attempting port 4042..."
> {noformat}
> 14/05/22 11:31:54 WARN component.AbstractLifeCycle: FAILED 
> SelectChannelConnector@0.0.0.0:4041: java.net.BindException: Address already 
> in use
> java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:444)
> at sun.nio.ch.Net.bind(Net.java:436)
> at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
> at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
> at 
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
> at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at org.eclipse.jetty.server.Server.doStart(Server.java:293)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
> at org.apache.spark.SparkContext.(SparkContext.scala:217)
> at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
> at $line3.$read$$iwC$$iwC.(:8)
> at $line3.$read$$iwC.(:14)
> at $line3.$read.(:16)
> at $line3.$read$.(:20)
> at $line3.$read$.()
> at $line3.$eval$.(:7)
> at $line3.$eval$.()
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)

[jira] [Resolved] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error

2017-05-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20702.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> TaskContextImpl.markTaskCompleted should not hide the original error
> 
>
> Key: SPARK-20702
> URL: https://issues.apache.org/jira/browse/SPARK-20702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> If a TaskCompletionListener throws an error, 
> TaskContextImpl.markTaskCompleted will hide the original error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20727) Skip SparkR tests when missing Hadoop winutils on CRAN windows machines

2017-05-12 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-20727:
-

 Summary: Skip SparkR tests when missing Hadoop winutils on CRAN 
windows machines
 Key: SPARK-20727
 URL: https://issues.apache.org/jira/browse/SPARK-20727
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.1, 2.2.0
Reporter: Shivaram Venkataraman


We should skips tests that use the Hadoop libraries while running
on CRAN check with Windows as the operating system. This is to handle
cases where the Hadoop winutils binaries are not available on the target
system. The skipped tests will consist of
1. Tests that save, load a model in MLlib
2. Tests that save, load CSV, JSON and Parquet files in SQL
3. Hive tests

Note that these tests will still be run on AppVeyor for every PR, so our 
overall test coverage should not go down



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20714) Fix match error when watermark is set with timeout = no timeout / processing timeout

2017-05-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20714.
--
Resolution: Fixed

> Fix match error when watermark is set with timeout = no timeout / processing 
> timeout
> 
>
> Key: SPARK-20714
> URL: https://issues.apache.org/jira/browse/SPARK-20714
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> When watermark is set, and timeout conf is NoTimeout or ProcessingTimeTimeout 
> (both do not need the watermark), the query fails at runtime with the 
> following exception.
> {code}
> MatchException: 
> Some(org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate@1a9b798e)
>  (of class scala.Some)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:120)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:116)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
> 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008470#comment-16008470
 ] 

Apache Spark commented on SPARK-20666:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/17966

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
> ---
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Critical
>
> seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in 
> other test runs too, always only when running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scal

[jira] [Assigned] (SPARK-20727) Skip SparkR tests when missing Hadoop winutils on CRAN windows machines

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20727:


Assignee: Apache Spark

> Skip SparkR tests when missing Hadoop winutils on CRAN windows machines
> ---
>
> Key: SPARK-20727
> URL: https://issues.apache.org/jira/browse/SPARK-20727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> We should skips tests that use the Hadoop libraries while running
> on CRAN check with Windows as the operating system. This is to handle
> cases where the Hadoop winutils binaries are not available on the target
> system. The skipped tests will consist of
> 1. Tests that save, load a model in MLlib
> 2. Tests that save, load CSV, JSON and Parquet files in SQL
> 3. Hive tests
> Note that these tests will still be run on AppVeyor for every PR, so our 
> overall test coverage should not go down



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20727) Skip SparkR tests when missing Hadoop winutils on CRAN windows machines

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008480#comment-16008480
 ] 

Apache Spark commented on SPARK-20727:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/17966

> Skip SparkR tests when missing Hadoop winutils on CRAN windows machines
> ---
>
> Key: SPARK-20727
> URL: https://issues.apache.org/jira/browse/SPARK-20727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shivaram Venkataraman
>
> We should skips tests that use the Hadoop libraries while running
> on CRAN check with Windows as the operating system. This is to handle
> cases where the Hadoop winutils binaries are not available on the target
> system. The skipped tests will consist of
> 1. Tests that save, load a model in MLlib
> 2. Tests that save, load CSV, JSON and Parquet files in SQL
> 3. Hive tests
> Note that these tests will still be run on AppVeyor for every PR, so our 
> overall test coverage should not go down



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20727) Skip SparkR tests when missing Hadoop winutils on CRAN windows machines

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20727:


Assignee: (was: Apache Spark)

> Skip SparkR tests when missing Hadoop winutils on CRAN windows machines
> ---
>
> Key: SPARK-20727
> URL: https://issues.apache.org/jira/browse/SPARK-20727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shivaram Venkataraman
>
> We should skips tests that use the Hadoop libraries while running
> on CRAN check with Windows as the operating system. This is to handle
> cases where the Hadoop winutils binaries are not available on the target
> system. The skipped tests will consist of
> 1. Tests that save, load a model in MLlib
> 2. Tests that save, load CSV, JSON and Parquet files in SQL
> 3. Hive tests
> Note that these tests will still be run on AppVeyor for every PR, so our 
> overall test coverage should not go down



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008483#comment-16008483
 ] 

Apache Spark commented on SPARK-14659:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17967

> OneHotEncoder support drop first category alphabetically in the encoded 
> vector 
> ---
>
> Key: SPARK-14659
> URL: https://issues.apache.org/jira/browse/SPARK-14659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> R formula drop the first category alphabetically when encode string/category 
> feature. Spark RFormula use OneHotEncoder to encode string/category feature 
> into vector, but only supporting "dropLast" by string/category frequencies. 
> This will cause SparkR produce different models compared with native R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20594) The staging directory should be appended with ".hive-staging" to avoid being deleted if we set hive.exec.stagingdir under the table directory without start with "."

2017-05-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20594.
-
   Resolution: Fixed
 Assignee: zuotingbing
Fix Version/s: 2.2.0

> The staging directory should be appended with ".hive-staging" to avoid being 
> deleted if we set hive.exec.stagingdir under the table directory without 
> start with "."
> 
>
> Key: SPARK-20594
> URL: https://issues.apache.org/jira/browse/SPARK-20594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: zuotingbing
>Assignee: zuotingbing
> Fix For: 2.2.0
>
>
> The staging directory should be appended with ".hive-staging" to avoid being 
> deleted when we set hive.exec.stagingdir under the table directory without 
> start with "."
> spark-sql> set  hive.exec.stagingdir=./test;
> spark-sql> insert overwrite table test_table1 select * from test_table;
> we got error as follows:
> 2017-05-04 15:21:06,948 INFO org.apache.hadoop.hive.common.FileUtils: 
> deleting  
> hdfs://nameservice/spark/ztb.db/test_table1/test_hive_2017-05-04_15-21-05_972_7582740597864081934-1
> 2017-05-04 15:21:06,987 INFO org.apache.hadoop.fs.TrashPolicyDefault: Moved: 
> 'hdfs://nameservice/spark/ztb.db/test_table1/test_hive_2017-05-04_15-21-05_972_7582740597864081934-1'
>  to trash at: 
> hdfs://nameservice/user/mr/.Trash/Current/spark/ztb.db/test_table1/test_hive_2017-05-04_15-21-05_972_7582740597864081934-1
> 2017-05-04 15:21:06,987 INFO org.apache.hadoop.hive.common.FileUtils: Moved 
> to trash: 
> hdfs://nameservice/spark/ztb.db/test_table1/test_hive_2017-05-04_15-21-05_972_7582740597864081934-1
> 2017-05-04 15:21:07,001 ERROR org.apache.hadoop.hdfs.KeyProviderCache: Could 
> not find uri with key [dfs.encryption.key.provider.uri] to create a 
> keyProvider !!
> 2017-05-04 15:21:07,007 INFO hive.ql.metadata.Hive: Replacing 
> src:hdfs://nameservice/spark/ztb.db/test_table1/test_hive_2017-05-04_15-21-05_972_7582740597864081934-1/-ext-1/part-0,
>  dest: hdfs://nameservice/spark/ztb.db/test_table1/part-0, Status:false
> 2017-05-04 15:21:07,024 ERROR 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver: Failed in [insert 
> overwrite table test_table1 select * from test_table]
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:633)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:646)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:646)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:646)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:645)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:290)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:143)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.executeCollect(InsertIntoHiveTable.scala:308)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-12 Thread Kagan Turgut (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008526#comment-16008526
 ] 

Kagan Turgut commented on SPARK-13747:
--

I am having the same exception.

I am creating a new data source that processes reading batch files 
asynchronously into a temp folder and then returns them as a data frame.

Within the buildScan(): RDD[Row]  method  I have a loop that saves the results 
of each batch in a parquet file:

 val df = spark.sparkContext.parallelize(batchResult.records, 200).toDF() 
 df.write.mode(SaveMode.Overwrite).save(s"$tempDir${tempFile}

Then once the temp files are all written, buildScan method returns 
I will load all those temp files in parallel and return the union in an RDD 
like this:
sqlContext.read
  .schema(schema)
  .load(files: _*) 
  .queryExecution.executedPlan. execute().asInstanceOf[RDD[Row]]

I can see the concurrency issue as I am trying to write the temp files at same 
time I am trying to construct a return RDD.  
Is there a better way of doing this?
To work around, I can save the temp files as regular CSV to work around the 
issue, or upgrade to 2.12 to see if that fixes it, but I prefer to save these 
files as Parquet files using Spark API.



> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-12 Thread Kagan Turgut (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008526#comment-16008526
 ] 

Kagan Turgut edited comment on SPARK-13747 at 5/12/17 6:27 PM:
---

I am having the same exception.

I am creating a new data source that processes reading batch files 
asynchronously into a temp folder and then returns them as a data frame.

Within the buildScan(): RDD[Row]  method  I have a loop that saves the results 
of each batch in a parquet file:

 val df = spark.sparkContext.parallelize(batchResult.records, 200).toDF() 
 df.write.mode(SaveMode.Overwrite).save(tempFile)}

Then once the temp files are all written, buildScan method returns 
I will load all those temp files in parallel and return the union in an RDD 
like this:
sqlContext.read
  .schema(schema)
  .load(files: _*) 
  .queryExecution.executedPlan. execute().asInstanceOf[RDD[Row]]

I can see the concurrency issue as I am trying to write the temp files at same 
time I am trying to construct a return RDD.  
Is there a better way of doing this?
To work around, I can save the temp files as regular CSV to work around the 
issue, or upgrade to 2.12 to see if that fixes it, but I prefer to save these 
files as Parquet files using Spark API.




was (Author: kagan):
I am having the same exception.

I am creating a new data source that processes reading batch files 
asynchronously into a temp folder and then returns them as a data frame.

Within the buildScan(): RDD[Row]  method  I have a loop that saves the results 
of each batch in a parquet file:

 val df = spark.sparkContext.parallelize(batchResult.records, 200).toDF() 
 df.write.mode(SaveMode.Overwrite).save(s"$tempDir${tempFile}

Then once the temp files are all written, buildScan method returns 
I will load all those temp files in parallel and return the union in an RDD 
like this:
sqlContext.read
  .schema(schema)
  .load(files: _*) 
  .queryExecution.executedPlan. execute().asInstanceOf[RDD[Row]]

I can see the concurrency issue as I am trying to write the temp files at same 
time I am trying to construct a return RDD.  
Is there a better way of doing this?
To work around, I can save the temp files as regular CSV to work around the 
issue, or upgrade to 2.12 to see if that fixes it, but I prefer to save these 
files as Parquet files using Spark API.



> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-12 Thread Kagan Turgut (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008526#comment-16008526
 ] 

Kagan Turgut edited comment on SPARK-13747 at 5/12/17 6:30 PM:
---

I am having the same exception.

I am creating a new data source that processes reading batch files 
asynchronously into a temp folder and then returns them as a data frame.

Within the buildScan(): RDD[Row]  method  I have a loop that saves the results 
of each batch in a parquet file:

 val df = spark.sparkContext.parallelize(batchResult.records, 200).toDF() 
 df.write.mode(SaveMode.Overwrite).save(tempFile)}

Then once the temp files are all written, buildScan method returns 
I will load all those temp files in parallel and return the union in an RDD 
like this:
sqlContext.read
  .schema(schema)
  .load(files: _*) 
  .queryExecution.executedPlan. execute().asInstanceOf[RDD[Row]]

I can see the concurrency issue as I am trying to write the temp files at same 
time I am trying to construct a return RDD.  
Is there a better way of doing this?
To work around, I can save the temp files as regular CSV to work around the 
issuebut I prefer to save these files as Parquet files using Spark API.
Would upgrading to 2.12 fix this issue in my case?




was (Author: kagan):
I am having the same exception.

I am creating a new data source that processes reading batch files 
asynchronously into a temp folder and then returns them as a data frame.

Within the buildScan(): RDD[Row]  method  I have a loop that saves the results 
of each batch in a parquet file:

 val df = spark.sparkContext.parallelize(batchResult.records, 200).toDF() 
 df.write.mode(SaveMode.Overwrite).save(tempFile)}

Then once the temp files are all written, buildScan method returns 
I will load all those temp files in parallel and return the union in an RDD 
like this:
sqlContext.read
  .schema(schema)
  .load(files: _*) 
  .queryExecution.executedPlan. execute().asInstanceOf[RDD[Row]]

I can see the concurrency issue as I am trying to write the temp files at same 
time I am trying to construct a return RDD.  
Is there a better way of doing this?
To work around, I can save the temp files as regular CSV to work around the 
issue, or upgrade to 2.12 to see if that fixes it, but I prefer to save these 
files as Parquet files using Spark API.



> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-12 Thread Kagan Turgut (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008526#comment-16008526
 ] 

Kagan Turgut edited comment on SPARK-13747 at 5/12/17 6:30 PM:
---

I am having the same exception.

I am creating a new data source that processes reading batch files 
asynchronously into a temp folder and then returns them as a data frame.

Within the buildScan(): RDD[Row]  method  I have a loop that saves the results 
of each batch in a parquet file:

 val df = spark.sparkContext.parallelize(batchResult.records, 200).toDF() 
 df.write.mode(SaveMode.Overwrite).save(tempFile)}

Then once the temp files are all written, buildScan method returns 
I will load all those temp files in parallel and return the union in an RDD 
like this:
sqlContext.read
  .schema(schema)
  .load(files: _*) 
  .queryExecution.executedPlan. execute().asInstanceOf[RDD[Row]]

I can see the concurrency issue as I am trying to write the temp files at same 
time I am trying to construct a return RDD.  
Is there a better way of doing this?
To work around, I can save the temp files as regular CSV to work around the 
issue but I prefer to save these files as Parquet files using Spark API.
Would upgrading to 2.12 fix this issue in my case?




was (Author: kagan):
I am having the same exception.

I am creating a new data source that processes reading batch files 
asynchronously into a temp folder and then returns them as a data frame.

Within the buildScan(): RDD[Row]  method  I have a loop that saves the results 
of each batch in a parquet file:

 val df = spark.sparkContext.parallelize(batchResult.records, 200).toDF() 
 df.write.mode(SaveMode.Overwrite).save(tempFile)}

Then once the temp files are all written, buildScan method returns 
I will load all those temp files in parallel and return the union in an RDD 
like this:
sqlContext.read
  .schema(schema)
  .load(files: _*) 
  .queryExecution.executedPlan. execute().asInstanceOf[RDD[Row]]

I can see the concurrency issue as I am trying to write the temp files at same 
time I am trying to construct a return RDD.  
Is there a better way of doing this?
To work around, I can save the temp files as regular CSV to work around the 
issuebut I prefer to save these files as Parquet files using Spark API.
Would upgrading to 2.12 fix this issue in my case?



> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2017-05-12 Thread Eugene Ilchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008582#comment-16008582
 ] 

Eugene Ilchenko commented on SPARK-9686:


2.1.0 is still showing the same behavior. GetSchemas() returns data from 
in-memory Derby ("default" db only), while "show databases" is getting results 
from the properly configured local Hive metastore (i.e. MySQL).

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Assignee: Cheng Lian
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9792) PySpark DenseMatrix, SparseMatrix should override __eq__

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9792:
---

Assignee: (was: Apache Spark)

> PySpark DenseMatrix, SparseMatrix should override __eq__
> 
>
> Key: SPARK-9792
> URL: https://issues.apache.org/jira/browse/SPARK-9792
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> See [SPARK-9750].  Equality should be defined semantically, not in terms of 
> representation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9792) PySpark DenseMatrix, SparseMatrix should override __eq__

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008651#comment-16008651
 ] 

Apache Spark commented on SPARK-9792:
-

User 'gglanzani' has created a pull request for this issue:
https://github.com/apache/spark/pull/17968

> PySpark DenseMatrix, SparseMatrix should override __eq__
> 
>
> Key: SPARK-9792
> URL: https://issues.apache.org/jira/browse/SPARK-9792
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> See [SPARK-9750].  Equality should be defined semantically, not in terms of 
> representation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9792) PySpark DenseMatrix, SparseMatrix should override __eq__

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9792:
---

Assignee: Apache Spark

> PySpark DenseMatrix, SparseMatrix should override __eq__
> 
>
> Key: SPARK-9792
> URL: https://issues.apache.org/jira/browse/SPARK-9792
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Critical
>
> See [SPARK-9750].  Equality should be defined semantically, not in terms of 
> representation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4836) Web UI should display separate information for all stage attempts

2017-05-12 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008737#comment-16008737
 ] 

Alex Bozarth commented on SPARK-4836:
-

I looked into this a bit on behalf of [~ckadner] and I believe this is an issue 
with how JobProgressListener stores completed stages. I'm not sure if I have 
time to continue to look into this so if someone wants to look into this ping 
me first to make sure I'm still working on it.

> Web UI should display separate information for all stage attempts
> -
>
> Key: SPARK-4836
> URL: https://issues.apache.org/jira/browse/SPARK-4836
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>
> I've run into some cases where the web UI job page will say that a job took 
> 12 minutes but the sum of that job's stage times is something like 10 
> seconds.  In this case, it turns out that my job ran a stage to completion 
> (which took, say, 5 minutes) then lost some partitions of that stage and had 
> to run a new stage attempt to recompute one or two tasks from that stage.  As 
> a result, the latest attempt for that stage reports only one or two tasks.  
> In the web UI, it seems that we only show the latest stage attempt, not all 
> attempts, which can lead to confusing / misleading displays for jobs with 
> failed / partially-recomputed stages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20719) Support LIMIT ALL

2017-05-12 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20719.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Support LIMIT ALL
> -
>
> Key: SPARK-20719
> URL: https://issues.apache.org/jira/browse/SPARK-20719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> `LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by 
> both PrestgreSQL and Presto. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20728) Make ORCFileFormat configurable between sql/hive and sql/core

2017-05-12 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-20728:
-

 Summary: Make ORCFileFormat configurable between sql/hive and 
sql/core
 Key: SPARK-20728
 URL: https://issues.apache.org/jira/browse/SPARK-20728
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Dongjoon Hyun


SPARK-20682 is trying to improve Apache Spark to have a new ORCFileFormat based 
on Apache ORC for many reasons.

This issue depends on SPARK-20682 and aims to provide a configuration to choose 
the default ORCFileFormat from legacy `sql/hive` module or new `sql/core` 
module.

For example, this configuration will affects the following operations.
{code}
spark.read.orc(...)
{code}

{code}
CREATE TABLE t
USING ORC
...
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20728) Make ORCFileFormat configurable between sql/hive and sql/core

2017-05-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20728:
--
Issue Type: Improvement  (was: New Feature)

> Make ORCFileFormat configurable between sql/hive and sql/core
> -
>
> Key: SPARK-20728
> URL: https://issues.apache.org/jira/browse/SPARK-20728
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Dongjoon Hyun
>
> SPARK-20682 is trying to improve Apache Spark to have a new ORCFileFormat 
> based on Apache ORC for many reasons.
> This issue depends on SPARK-20682 and aims to provide a configuration to 
> choose the default ORCFileFormat from legacy `sql/hive` module or new 
> `sql/core` module.
> For example, this configuration will affects the following operations.
> {code}
> spark.read.orc(...)
> {code}
> {code}
> CREATE TABLE t
> USING ORC
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20682:
--
Issue Type: Improvement  (was: Bug)

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20728) Make ORCFileFormat configurable between sql/hive and sql/core

2017-05-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20728:
--
Issue Type: New Feature  (was: Bug)

> Make ORCFileFormat configurable between sql/hive and sql/core
> -
>
> Key: SPARK-20728
> URL: https://issues.apache.org/jira/browse/SPARK-20728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Dongjoon Hyun
>
> SPARK-20682 is trying to improve Apache Spark to have a new ORCFileFormat 
> based on Apache ORC for many reasons.
> This issue depends on SPARK-20682 and aims to provide a configuration to 
> choose the default ORCFileFormat from legacy `sql/hive` module or new 
> `sql/core` module.
> For example, this configuration will affects the following operations.
> {code}
> spark.read.orc(...)
> {code}
> {code}
> CREATE TABLE t
> USING ORC
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-05-12 Thread Greg Rahn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008856#comment-16008856
 ] 

Greg Rahn commented on SPARK-18004:
---

The right solution here is to make an explicit cast on the string 
representation of the timestamp value and not rely on implicit casting by the 
database
ANSI casting should work in nearly every RDBMS out there if the string is ISO 
8601 format.  Eg.
{noformat}
select timestamp '2016-10-19 12:54:01.934';
timestamp
-
 2016-10-19 12:54:01.934

{noformat}

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---

[jira] [Commented] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-12 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008898#comment-16008898
 ] 

Weichen Xu commented on SPARK-20504:


I have already taken the following steps to check this QA issue, and I also 
attach some output logs in this email, I skipped the `mllib` package which are 
deprecated:


1)  use `jar -tf` to extract the classes in `ml` package, towards master 
version and 2.1.1 version, but I use `grep` to filter some nested classes 
(which class name contains “$”)


2) extracts the classes both existed in master version and 2.1.1 version, and 
use `javap -protected -s` to get the signature information of them, and use 
`diff` to compare their difference, and I manually check each difference, check 
their corresponding scala-doc and java-doc for consistency and potential 
incompatible problems.


3) extracts the classes added after 2.1.1 version, these classes are:
---
org.apache.spark.ml.classification.LinearSVC
org.apache.spark.ml.classification.LinearSVC$
org.apache.spark.ml.classification.LinearSVCAggregator
org.apache.spark.ml.classification.LinearSVCCostFun
org.apache.spark.ml.classification.LinearSVCModel
org.apache.spark.ml.classification.LinearSVCModel$
org.apache.spark.ml.classification.LinearSVCParams
org.apache.spark.ml.clustering.ExpectationAggregator
org.apache.spark.ml.feature.Imputer
org.apache.spark.ml.feature.Imputer$
org.apache.spark.ml.feature.ImputerModel
org.apache.spark.ml.feature.ImputerModel$
org.apache.spark.ml.feature.ImputerParams
org.apache.spark.ml.fpm.AssociationRules
org.apache.spark.ml.fpm.AssociationRules$
org.apache.spark.ml.fpm.FPGrowth
org.apache.spark.ml.fpm.FPGrowth$
org.apache.spark.ml.fpm.FPGrowthModel
org.apache.spark.ml.fpm.FPGrowthModel$
org.apache.spark.ml.fpm.FPGrowthParams
org.apache.spark.ml.r.BisectingKMeansWrapper
org.apache.spark.ml.r.BisectingKMeansWrapper$
org.apache.spark.ml.recommendation.TopByKeyAggregator
org.apache.spark.ml.r.FPGrowthWrapper
org.apache.spark.ml.r.FPGrowthWrapper$
org.apache.spark.ml.r.LinearSVCWrapper
org.apache.spark.ml.r.LinearSVCWrapper$
org.apache.spark.ml.source.libsvm.LibSVMOptions
org.apache.spark.ml.source.libsvm.LibSVMOptions$
org.apache.spark.ml.stat.ChiSquareTest
org.apache.spark.ml.stat.ChiSquareTest$
org.apache.spark.ml.stat.Correlation
org.apache.spark.ml.stat.Correlation$
--
To these classes, I use `javap -s` to get their signatures and also manually 
check their corresponding scala-doc and java-docs.


After I check the things listed above, I found no problem related to java 
compatibility.
Only a small problem is, the `private` class marked in scala code, when 
compiled into bytecode, the `private` modifier seems to be lost and `javap` 
regard them as `public` classes. and Java-docs will also include these classes, 
these classes contains `***Aggregator`, `***CostFun` and so on but I think it 
is the problem scala compiler need to resolve. 


I attach the processing script I wrote and some intermediate output files for 
your further check, including:
1) processing script
2) class and method signature diff result between 2.1.1 and master version, for 
`ml` classes existing both in the two version.
3) class and method signature of the `ml` classes added after version 2.1.1
4) classes existing both in master and 2.1.1 version
5) classes added after version 2.1.1

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find is

[jira] [Updated] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-12 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-20504:
---
Attachment: 5_added_ml_class
4_common_ml_class
3_added_class_signature
2_signature.diff
1_process_script.sh

I attach the processing script I wrote and some intermediate output files for 
your further check, including:
1) processing script
2) class and method signature diff result between 2.1.1 and master version, for 
`ml` classes existing both in the two version.
3) class and method signature of the `ml` classes added after version 2.1.1
4) classes existing both in master and 2.1.1 version
5) classes added after version 2.1.1

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, 2_signature.diff, 
> 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008905#comment-16008905
 ] 

Joseph K. Bradley commented on SPARK-20504:
---

Thanks!  The summary and results look good to me, so I'll close this.  We can 
reuse the script in the future too!

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.2.0
>
> Attachments: 1_process_script.sh, 2_signature.diff, 
> 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-12 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008898#comment-16008898
 ] 

Weichen Xu edited comment on SPARK-20504 at 5/12/17 11:30 PM:
--

I have already taken the following steps to check this QA issue, and I also 
attach some output logs in this email, I skipped the `mllib` package which are 
deprecated:


1)  use `jar -tf` to extract the classes in `ml` package, towards master 
version and 2.1.1 version, but I use `grep` to filter some nested classes 
(which class name contains “$”, but classname ends with an `$` maybe an 
`object` and should be reserved.)


2) extracts the classes both existed in master version and 2.1.1 version, and 
use `javap -protected -s` to get the signature information of them, and use 
`diff` to compare their difference, and I manually check each difference, check 
their corresponding scala-doc and java-doc for consistency and potential 
incompatible problems.


3) extracts the classes added after 2.1.1 version, these classes are:
---
org.apache.spark.ml.classification.LinearSVC
org.apache.spark.ml.classification.LinearSVC$
org.apache.spark.ml.classification.LinearSVCAggregator
org.apache.spark.ml.classification.LinearSVCCostFun
org.apache.spark.ml.classification.LinearSVCModel
org.apache.spark.ml.classification.LinearSVCModel$
org.apache.spark.ml.classification.LinearSVCParams
org.apache.spark.ml.clustering.ExpectationAggregator
org.apache.spark.ml.feature.Imputer
org.apache.spark.ml.feature.Imputer$
org.apache.spark.ml.feature.ImputerModel
org.apache.spark.ml.feature.ImputerModel$
org.apache.spark.ml.feature.ImputerParams
org.apache.spark.ml.fpm.AssociationRules
org.apache.spark.ml.fpm.AssociationRules$
org.apache.spark.ml.fpm.FPGrowth
org.apache.spark.ml.fpm.FPGrowth$
org.apache.spark.ml.fpm.FPGrowthModel
org.apache.spark.ml.fpm.FPGrowthModel$
org.apache.spark.ml.fpm.FPGrowthParams
org.apache.spark.ml.r.BisectingKMeansWrapper
org.apache.spark.ml.r.BisectingKMeansWrapper$
org.apache.spark.ml.recommendation.TopByKeyAggregator
org.apache.spark.ml.r.FPGrowthWrapper
org.apache.spark.ml.r.FPGrowthWrapper$
org.apache.spark.ml.r.LinearSVCWrapper
org.apache.spark.ml.r.LinearSVCWrapper$
org.apache.spark.ml.source.libsvm.LibSVMOptions
org.apache.spark.ml.source.libsvm.LibSVMOptions$
org.apache.spark.ml.stat.ChiSquareTest
org.apache.spark.ml.stat.ChiSquareTest$
org.apache.spark.ml.stat.Correlation
org.apache.spark.ml.stat.Correlation$
--
To these classes, I use `javap -s` to get their signatures and also manually 
check their corresponding scala-doc and java-docs.


After I check the things listed above, I found no problem related to java 
compatibility.
Only a small problem is, the `private` class marked in scala code, when 
compiled into bytecode, the `private` modifier seems to be lost and `javap` 
regard them as `public` classes. and Java-docs will also include these classes, 
these classes contains `***Aggregator`, `***CostFun` and so on but I think it 
is the problem scala compiler need to resolve. 


I attach the processing script I wrote and some intermediate output files for 
your further check, including:
1) processing script
2) class and method signature diff result between 2.1.1 and master version, for 
`ml` classes existing both in the two version.
3) class and method signature of the `ml` classes added after version 2.1.1
4) classes existing both in master and 2.1.1 version
5) classes added after version 2.1.1


was (Author: weichenxu123):
I have already taken the following steps to check this QA issue, and I also 
attach some output logs in this email, I skipped the `mllib` package which are 
deprecated:


1)  use `jar -tf` to extract the classes in `ml` package, towards master 
version and 2.1.1 version, but I use `grep` to filter some nested classes 
(which class name contains “$”)


2) extracts the classes both existed in master version and 2.1.1 version, and 
use `javap -protected -s` to get the signature information of them, and use 
`diff` to compare their difference, and I manually check each difference, check 
their corresponding scala-doc and java-doc for consistency and potential 
incompatible problems.


3) extracts the classes added after 2.1.1 version, these classes are:
---
org.apache.spark.ml.classification.LinearSVC
org.apache.spark.ml.classification.LinearSVC$
org.apache.spark.ml.classification.LinearSVCAggregator
org.apache.spark.ml.classification.LinearSVCCostFun
org.apache.spark.ml.classification.LinearSVCModel
org.apache.spark.ml.classification.LinearSVCModel$
org.apache.spark.ml.classification.LinearSVCParams
org.apache.spark.ml.clustering.ExpectationAggregator
org.apache.spark.ml.feature.Imputer
org.apache.spark.ml.feature.Imputer$
org.apache.spark.ml.feature.ImputerModel
org.apache.spark.ml.feature.ImputerModel$
org.apache.spark.ml.feature.I

[jira] [Resolved] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20504.
---
   Resolution: Done
Fix Version/s: 2.2.0

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.2.0
>
> Attachments: 1_process_script.sh, 2_signature.diff, 
> 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20729) Reduce boilerplate in Spark ML models

2017-05-12 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-20729:
--

 Summary: Reduce boilerplate in Spark ML models
 Key: SPARK-20729
 URL: https://issues.apache.org/jira/browse/SPARK-20729
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz


Currently we implement both {{predict}} and {{write.ml}} for ML wrappers, 
although R code is virtually identical and all the model specific logic is 
handled by Scala wrappers.

Since we use S4 classes we can extract these functionalities into separate 
traits. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20729) Reduce boilerplate in Spark ML models

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008922#comment-16008922
 ] 

Apache Spark commented on SPARK-20729:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17969

> Reduce boilerplate in Spark ML models
> -
>
> Key: SPARK-20729
> URL: https://issues.apache.org/jira/browse/SPARK-20729
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Currently we implement both {{predict}} and {{write.ml}} for ML wrappers, 
> although R code is virtually identical and all the model specific logic is 
> handled by Scala wrappers.
> Since we use S4 classes we can extract these functionalities into separate 
> traits. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20729) Reduce boilerplate in Spark ML models

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20729:


Assignee: (was: Apache Spark)

> Reduce boilerplate in Spark ML models
> -
>
> Key: SPARK-20729
> URL: https://issues.apache.org/jira/browse/SPARK-20729
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Currently we implement both {{predict}} and {{write.ml}} for ML wrappers, 
> although R code is virtually identical and all the model specific logic is 
> handled by Scala wrappers.
> Since we use S4 classes we can extract these functionalities into separate 
> traits. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20729) Reduce boilerplate in Spark ML models

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20729:


Assignee: Apache Spark

> Reduce boilerplate in Spark ML models
> -
>
> Key: SPARK-20729
> URL: https://issues.apache.org/jira/browse/SPARK-20729
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> Currently we implement both {{predict}} and {{write.ml}} for ML wrappers, 
> although R code is virtually identical and all the model specific logic is 
> handled by Scala wrappers.
> Since we use S4 classes we can extract these functionalities into separate 
> traits. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8682) Range Join for Spark SQL

2017-05-12 Thread Gesly George (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009017#comment-16009017
 ] 

Gesly George commented on SPARK-8682:
-

Any chance that this will get addressed in an upcoming release? Range joins are 
critical for timeseries data and something we need to do quite often. 

> Range Join for Spark SQL
> 
>
> Key: SPARK-8682
> URL: https://issues.apache.org/jira/browse/SPARK-8682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
> Attachments: perf_testing.scala
>
>
> Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered 
> Cartesian Join) when it has to execute the following range query:
> {noformat}
> SELECT A.*,
>B.*
> FROM   tableA A
>JOIN tableB B
> ON A.start <= B.end
>  AND A.end > B.start
> {noformat}
> This is horribly inefficient. The performance of this query can be greatly 
> improved, when one of the tables can be broadcasted, by creating a range 
> index. A range index is basically a sorted map containing the rows of the 
> smaller table, indexed by both the high and low keys. using this structure 
> the complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = 
> number of records in the larger table, M = number of records in the smaller 
> (indexed) table.
> I have created a pull request for this. According to the [Spark SQL: 
> Relational Data Processing in 
> Spark|http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf] 
> paper similar work (page 11, section 7.2) has already been done by the ADAM 
> project (cannot locate the code though). 
> Any comments and/or feedback are greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20730) Add a new Optimizer rule to combine nested Concats

2017-05-12 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-20730:


 Summary: Add a new Optimizer rule to combine nested Concats
 Key: SPARK-20730
 URL: https://issues.apache.org/jira/browse/SPARK-20730
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro


The master supports a pipeline operator '||' to concatenate strings. Since the 
parser generates nested Concat expressions, the optimizer needs to combine the 
nested expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20730) Add a new Optimizer rule to combine nested Concats

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009051#comment-16009051
 ] 

Apache Spark commented on SPARK-20730:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17970

> Add a new Optimizer rule to combine nested Concats
> --
>
> Key: SPARK-20730
> URL: https://issues.apache.org/jira/browse/SPARK-20730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>
> The master supports a pipeline operator '||' to concatenate strings. Since 
> the parser generates nested Concat expressions, the optimizer needs to 
> combine the nested expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20730) Add a new Optimizer rule to combine nested Concats

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20730:


Assignee: Apache Spark

> Add a new Optimizer rule to combine nested Concats
> --
>
> Key: SPARK-20730
> URL: https://issues.apache.org/jira/browse/SPARK-20730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> The master supports a pipeline operator '||' to concatenate strings. Since 
> the parser generates nested Concat expressions, the optimizer needs to 
> combine the nested expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20730) Add a new Optimizer rule to combine nested Concats

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20730:


Assignee: (was: Apache Spark)

> Add a new Optimizer rule to combine nested Concats
> --
>
> Key: SPARK-20730
> URL: https://issues.apache.org/jira/browse/SPARK-20730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>
> The master supports a pipeline operator '||' to concatenate strings. Since 
> the parser generates nested Concat expressions, the optimizer needs to 
> combine the nested expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-05-12 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009146#comment-16009146
 ] 

Wenchen Fan commented on SPARK-19122:
-

sorry I forgot to set the broadcast threshold, now I can reproduce this issue

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-05-12 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-19122:

Description: 
`table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
respective order)

This is how they are generated:
{code}
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")
{code}

Now, if join predicates are specified in query in *same* order as bucketing and 
sort order, there is no shuffle and sort.

{code}
scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)

== Physical Plan ==
*SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
:- *Project [i#60, j#61, k#62]
:  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
: +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Project [i#99, j#100, k#101]
   +- *Filter (isnotnull(j#100) && isnotnull(k#101))
  +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}


The same query with join predicates in *different* order from bucketing and 
sort order leads to extra shuffle and sort being introduced

{code}
scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
").explain(true)

== Physical Plan ==
*SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
:- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#62, j#61, 200)
: +- *Project [i#60, j#61, k#62]
:+- *Filter (isnotnull(k#62) && isnotnull(j#61))
:   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k#101, j#100, 200)
  +- *Project [i#99, j#100, k#101]
 +- *Filter (isnotnull(j#100) && isnotnull(k#101))
+- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}

  was:
`table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
respective order)

This is how they are generated:
{code}
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")
{code}

Now, if join predicates are specified in query in *same* order as bucketing and 
sort order, there is no shuffle and sort.

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)

== Physical Plan ==
*SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
:- *Project [i#60, j#61, k#62]
:  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
: +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Project [i#99, j#100, k#101]
   +- *Filter (isnotnull(j#100) && isnotnull(k#101))
  +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}


The same query with join predicates in *different* order from bucketing and 
sort order leads to extra shuffle and sort being introduced

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
").explain(true)

== Physical Plan ==
*SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
:- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#62, j#61, 200)
: +- *Project [i#60, j#61, k#62]
:+- *Filter (isnotnull(k#62) && isnotnull(j#61))
:   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), 

[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-05-12 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009165#comment-16009165
 ] 

Tejas Patil commented on SPARK-19122:
-

Thanks for confirming. I have added it in the jira description in case someone 
comes across this in future.

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-12 Thread madhukara phatak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

madhukara phatak updated SPARK-20723:
-
Description: 
Currently Random Forest implementation cache as the intermediatery data using 
*MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
So we should expose an expert param *intermediateStorageLevel* which allows 
user to customise the storage level. This is similar to als options like 
specified in below jira

https://issues.apache.org/jira/browse/SPARK-14412

  was:
Currently Random Forest implementation cache as the intermediatery data using 
*MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
So we should expose an expert param *intermediateRDDStorageLevel* which allows 
user to customise the storage level. This is similar to als options like 
specified in below jira

https://issues.apache.org/jira/browse/SPARK-14412


> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009171#comment-16009171
 ] 

Apache Spark commented on SPARK-20723:
--

User 'phatak-dev' has created a pull request for this issue:
https://github.com/apache/spark/pull/17972

> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20723:


Assignee: (was: Apache Spark)

> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20723) Random Forest Classifier should expose intermediateRDDStorageLevel similar to ALS

2017-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20723:


Assignee: Apache Spark

> Random Forest Classifier should expose intermediateRDDStorageLevel similar to 
> ALS
> -
>
> Key: SPARK-20723
> URL: https://issues.apache.org/jira/browse/SPARK-20723
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: madhukara phatak
>Assignee: Apache Spark
>Priority: Minor
>
> Currently Random Forest implementation cache as the intermediatery data using 
> *MEMORY_AND_DISK* storage level. This creates issues in low memory scenarios. 
> So we should expose an expert param *intermediateStorageLevel* which allows 
> user to customise the storage level. This is similar to als options like 
> specified in below jira
> https://issues.apache.org/jira/browse/SPARK-14412



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >