[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-05 Thread Xinyong Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502794#comment-16502794
 ] 

Xinyong Tian commented on SPARK-24431:
--

I read more about first point of or curve
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/
In the above example, when setting predicted probability for each row as 0.01, 
only one point on pr curve is defined, ie recall=1, precision =0.01.  according 
to the website, first point on the problem curve should be a horizontal line 
from 2nd point (the only point (1,0.01) here), which should be (0,0.01).  In 
this way, the no model 's  areaUnderPR=0.01,  instead of 0.05.

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-05 Thread Xinyong Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502761#comment-16502761
 ] 

Xinyong Tian commented on SPARK-24431:
--

Your understanding of event rate is what I meant.
I understand that max areaUnderPR can be 1. What I meant is that 0.5 is the max 
areaUnderPR for the grid I searched. For example. Let us say there is  a 
dataset with event rate 0.01 and the best model's  areaUnderPR is 0.30. But 
without any model ,we can set predicted probability for each row as 0.01. This 
is the situation when there is too much regularlzation. The problem is that , 
at this situation , BinaryClassificationEvaluator will calculate areaUnderPR as 
0.50(for reason see original description), which is better than the best model 
. This is not what we want. 

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2018-06-05 Thread niuhuawei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502740#comment-16502740
 ] 

niuhuawei edited comment on SPARK-20712 at 6/6/18 1:59 AM:
---

I tried to fix the problem. 

First, I attempted to reproduce it, and I got it.

Just run ./bin/pyspark under the root of SPARK_HOME, then type the flowing words
{code:java}
// code python
spark.range(10).selectExpr(*(map(lambda x: "id as very_long_column_name_id" + 
str, range(200.selectExpr("struct as 
nested").write.saveAsTable("test"){code}
so, it appears again 

"
 NestedThrowablesStackTrace: java.sql.SQLDataException: A truncation error was 
encountered trying to shrink VARCHAR 
'struct>> spark.range(10).selectExpr(*(map(lambda x:  "id as 
>>> very_long_column_name_id" + str(x), range(200.selectExpr("struct(*) as 
>>> nested").write.saveAsTable("test")
"

so, it appears again 

"
NestedThrowablesStackTrace: java.sql.SQLDataException: A truncation error was 
encountered trying to shrink VARCHAR 
'struct [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> 

[jira] [Commented] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2018-06-05 Thread niuhuawei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502740#comment-16502740
 ] 

niuhuawei commented on SPARK-20712:
---

I tried to fix the problem. 

First, I attempted to reproduce it, and I got it.

Just run ./bin/pyspark under the root of SPARK_HOME, then type the flowing words

"
>>> spark.range(10).selectExpr(*(map(lambda x:  "id as 
>>> very_long_column_name_id" + str(x), range(200.selectExpr("struct(*) as 
>>> nested").write.saveAsTable("test")
"

so, it appears again 

"
NestedThrowablesStackTrace: java.sql.SQLDataException: A truncation error was 
encountered trying to shrink VARCHAR 
'struct [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> 

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-05 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502733#comment-16502733
 ] 

Miao Wang commented on SPARK-15784:
---

[~WeichenXu123] Thank you very much! 

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24187) add array_join

2018-06-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24187.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21313
[https://github.com/apache/spark/pull/21313]

> add array_join
> --
>
> Key: SPARK-24187
> URL: https://issues.apache.org/jira/browse/SPARK-24187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> add R version of https://issues.apache.org/jira/browse/SPARK-23916



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24187) add array_join

2018-06-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24187:


Assignee: Huaxin Gao

> add array_join
> --
>
> Key: SPARK-24187
> URL: https://issues.apache.org/jira/browse/SPARK-24187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> add R version of https://issues.apache.org/jira/browse/SPARK-23916



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException

2018-06-05 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502609#comment-16502609
 ] 

Shixiong Zhu commented on SPARK-24472:
--

cc [~cloud_fan]

> Orc RecordReaderFactory throws IndexOutOfBoundsException
> 
>
> Key: SPARK-24472
> URL: https://issues.apache.org/jira/browse/SPARK-24472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When the column number of the underlying file schema is greater than the 
> column number of the table schema, Orc RecordReaderFactory will throw 
> IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be 
> turned off to use HiveTableScanExec. Here is a reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> Seq(("abc", 123, 123L)).toDF("s", "i", 
> "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest")
> spark.sql("""
> CREATE EXTERNAL TABLE orctest(s string)
> PARTITIONED BY (i int)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION '/tmp/orctest'
> """)
> spark.sql("msck repair table orctest")
> spark.sql("set spark.sql.hive.convertMetastoreOrc=false")
> // Exiting paste mode, now interpreting.
> 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.read.format("orc").load("/tmp/orctest").show()
> +---+---+---+
> |  s|  l|  i|
> +---+---+---+
> |abc|123|123|
> +---+---+---+
> scala> spark.sql("select * from orctest").show()
> 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.IndexOutOfBoundsException: toIndex = 2
>   at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
>   at java.util.ArrayList.subList(ArrayList.java:996)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> 

[jira] [Created] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException

2018-06-05 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-24472:


 Summary: Orc RecordReaderFactory throws IndexOutOfBoundsException
 Key: SPARK-24472
 URL: https://issues.apache.org/jira/browse/SPARK-24472
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Shixiong Zhu


When the column number of the underlying file schema is greater than the column 
number of the table schema, Orc RecordReaderFactory will throw 
IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be 
turned off to use HiveTableScanExec. Here is a reproducer:

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

Seq(("abc", 123, 123L)).toDF("s", "i", 
"l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest")

spark.sql("""
CREATE EXTERNAL TABLE orctest(s string)
PARTITIONED BY (i int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/tmp/orctest'
""")

spark.sql("msck repair table orctest")

spark.sql("set spark.sql.hive.convertMetastoreOrc=false")


// Exiting paste mode, now interpreting.

18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.read.format("orc").load("/tmp/orctest").show()
+---+---+---+
|  s|  l|  i|
+---+---+---+
|abc|123|123|
+---+---+---+


scala> spark.sql("select * from orctest").show()
18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.IndexOutOfBoundsException: toIndex = 2
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)

[jira] [Commented] (SPARK-24471) MlLib distributed plans

2018-06-05 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-24471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502429#comment-16502429
 ] 

Tomasz Gawęda commented on SPARK-24471:
---

Questions should be posted on Mailing Lists, see 
http://spark.apache.org/community.html

> MlLib distributed plans
> ---
>
> Key: SPARK-24471
> URL: https://issues.apache.org/jira/browse/SPARK-24471
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Kyle Prifogle
>Priority: Major
>
>  
> I have found myself using MlLib's CoordinateMatrix and RowMatrix alot lately. 
>  Since the new API is centered on Ml.linalg and MlLib is in maintenence mode 
> are there plans to move all the matrix components over to Ml.linalg?  I dont 
> see a distributed package in the new one yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-06-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Labels: Hydrogen SPIP  (was: SPIP)

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast

2018-06-05 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22384.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19602
[https://github.com/apache/spark/pull/19602]

> Refine partition pruning when attribute is wrapped in Cast
> --
>
> Key: SPARK-22384
> URL: https://issues.apache.org/jira/browse/SPARK-22384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> Sql below will get all partitions from metastore, which put much burden on 
> metastore;
> {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}}
> {{SELECT * from test where dt=2017}}
> The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} 
> and {{HiveShim}} fails to generate a proper partition filter.
> Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in 
> my warehouse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast

2018-06-05 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22384:
---

Assignee: jin xing

> Refine partition pruning when attribute is wrapped in Cast
> --
>
> Key: SPARK-22384
> URL: https://issues.apache.org/jira/browse/SPARK-22384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
>
> Sql below will get all partitions from metastore, which put much burden on 
> metastore;
> {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}}
> {{SELECT * from test where dt=2017}}
> The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} 
> and {{HiveShim}} fails to generate a proper partition filter.
> Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in 
> my warehouse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24471) MlLib distributed plans

2018-06-05 Thread Kyle Prifogle (JIRA)
Kyle Prifogle created SPARK-24471:
-

 Summary: MlLib distributed plans
 Key: SPARK-24471
 URL: https://issues.apache.org/jira/browse/SPARK-24471
 Project: Spark
  Issue Type: Question
  Components: MLlib
Affects Versions: 2.3.0
Reporter: Kyle Prifogle


 

I have found myself using MlLib's CoordinateMatrix and RowMatrix alot lately.  
Since the new API is centered on Ml.linalg and MlLib is in maintenence mode are 
there plans to move all the matrix components over to Ml.linalg?  I dont see a 
distributed package in the new one yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24470) RestSubmissionClient to be robust against 404 & non json responses

2018-06-05 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502177#comment-16502177
 ] 

Steve Loughran commented on SPARK-24470:


stack from the issue
{code}
Running Spark using the REST application submission protocol.
Exception in thread "main" 
org.apache.spark.deploy.rest.SubmitRestProtocolException: Malformed response 
received from server
at 
org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:269)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:429)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:441)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:216)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected 
character ('<' (code 60)): expected a valid value (number, String, array, 
object, 'true', 'false' or 'null')
at [Source: 


Error 404 Not Found

HTTP ERROR 404
Problem accessing /v1/submissions/create. Reason:
Not FoundPowered by 
Jetty://



; line: 1, column: 2]
at 
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
at 
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
at 
com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:462)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1624)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:689)
at 
com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3776)
at 
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
at 
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
at 
org.apache.spark.deploy.rest.SubmitRestProtocolMessage$.parseAction(SubmitRestProtocolMessage.scala:112)
at 
org.apache.spark.deploy.rest.SubmitRestProtocolMessage$.fromJson(SubmitRestProtocolMessage.scala:130)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$1.apply(RestSubmissionClient.scala:248)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$1.apply(RestSubmissionClient.scala:235)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

> RestSubmissionClient to be robust against 404 & non json responses
> --
>
> Key: SPARK-24470
> URL: https://issues.apache.org/jira/browse/SPARK-24470
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: 

[jira] [Created] (SPARK-24470) RestSubmissionClient to be robust against 404 & non json responses

2018-06-05 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-24470:
--

 Summary: RestSubmissionClient to be robust against 404 & non json 
responses
 Key: SPARK-24470
 URL: https://issues.apache.org/jira/browse/SPARK-24470
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Steve Loughran


judging by [Stack overflow 
50677915|https://stackoverflow.com/questions/50677915/unable-to-run-spark-submit-command],
 The RestSubmissionClient doesn't check the error code or content type before 
handing off to jackson. So a 404 with an HTML response isn't handled that well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24469) Support collations in Spark SQL

2018-06-05 Thread Alexander Shkapsky (JIRA)
Alexander Shkapsky created SPARK-24469:
--

 Summary: Support collations in Spark SQL
 Key: SPARK-24469
 URL: https://issues.apache.org/jira/browse/SPARK-24469
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Alexander Shkapsky


One of our use cases is to support case-insensitive comparison in operations, 
including aggregation and text comparison filters.  Another use case is to sort 
via collator.  Support for collations throughout the query processor appear to 
be the proper way to support these needs.

Language-based worked arounds (for the aggregation case) are insufficient:
 # SELECT UPPER(text)GROUP BY UPPER(text)
introduces invalid values into the output set
 # SELECT MIN(text)...GROUP BY UPPER(text) 
results in poor performance in our case, in part due to use of sort-based 
aggregate

Examples of collation support in RDBMS:
 * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html]
 * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html]
 * 
[Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html]
 * [SQL 
Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017]
 * 
[DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html]
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24468:


Assignee: Apache Spark

> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Assignee: Apache Spark
>Priority: Minor
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t"){code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24468:


Assignee: (was: Apache Spark)

> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Priority: Minor
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t"){code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502038#comment-16502038
 ] 

Apache Spark commented on SPARK-24468:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21499

> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Priority: Minor
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t"){code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-05 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501655#comment-16501655
 ] 

Marco Gaido commented on SPARK-24468:
-

Thanks for reporting this. I will submit soon a fix. Thanks.

> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Priority: Minor
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t"){code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24429) Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s

2018-06-05 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24429:

Issue Type: Improvement  (was: Bug)

> Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s
> --
>
> Key: SPARK-24429
> URL: https://issues.apache.org/jira/browse/SPARK-24429
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
>  
> Right now in cluster mode only extraJavaOptions targeting the executor are 
> set. 
> According to the implementation and the docs:
> "In client mode, this config must not be set through the {{SparkConf}} 
> directly in your application, because the driver JVM has already started at 
> that point. Instead, please set this through the {{--driver-java-options}} 
> command line option or in your default properties file."
> A typical driver launch in cluster mode will eventually use client mode to 
> run the Spark-submit and looks like:
> "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit 
> --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal 1"
> Also at the entrypoint.sh file there is no management of the driver's java 
> opts. 
> We propose to set an env var to pass the extra java opts to the driver (like 
> in the case of the executor), rename the env vars in the container as the one 
> for the executor is a bit misleading, and use --driver-java-options to pass 
> the required options.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24410:


Assignee: (was: Apache Spark)

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-06-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501560#comment-16501560
 ] 

Apache Spark commented on SPARK-24410:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21498

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24410:


Assignee: Apache Spark

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Assignee: Apache Spark
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-05 Thread Yifei Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifei Wu updated SPARK-24468:
-
Description: 
Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.

When multiplying a LongType with a decimal in scientific notation, say
{code:java}
spark.sql("select some_int * 2.34E10 from t"){code}
, decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be casted 
as decimal(20,0).

 

So according to the rules in comments:
{code:java}
/*
 *   OperationResult PrecisionResult Scale
 *   
 *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 * e2  p1 + p2 + 1 s1 + s2
 *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
 *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
 *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
*/
{code}
their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
assumption (scale>=0) on DecimalType.scala:166.

 

My current workaround is to set spark.sql.decimalOperations.allowPrecisionLoss 
to false.

 

  was:
Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.

When multiplying a LongType with a decimal in scientific notation, say
{code:java}
spark.sql("select some_int * 2.34E10 from t").show{code}
, decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be casted 
as decimal(20,0).

 

So according to the rules in comments:
{code:java}
/*
 *   OperationResult PrecisionResult Scale
 *   
 *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 * e2  p1 + p2 + 1 s1 + s2
 *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
 *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
 *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
*/
{code}
their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
assumption (scale>=0) on DecimalType.scala:166.

 

My current workaround is to set spark.sql.decimalOperations.allowPrecisionLoss 
to false.

 


> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Priority: Minor
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t"){code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-05 Thread Yifei Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifei Wu updated SPARK-24468:
-
Description: 
Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.

When multiplying a LongType with a decimal in scientific notation, say
{code:java}
spark.sql("select some_int * 2.34E10 from t").show{code}
, decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be casted 
as decimal(20,0).

 

So according to the rules in comments:
{code:java}
/*
 *   OperationResult PrecisionResult Scale
 *   
 *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 * e2  p1 + p2 + 1 s1 + s2
 *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
 *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
 *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
*/
{code}
their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
assumption (scale>=0) on DecimalType.scala:166.

 

My current workaround is to set spark.sql.decimalOperations.allowPrecisionLoss 
to false.

 

  was:
Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.

When multiplying a LongType with a decimal in scientific notation, say
{code}
spark.sql("select some_int * 2.34E10 from t").show{code}
, decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be casted 
as decimal(20,0).

So according to the rules in comments:
{code:java}
/*
 *   OperationResult PrecisionResult Scale
 *   
 *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 * e2  p1 + p2 + 1 s1 + s2
 *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
 *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
 *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
*/
{code}
their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
assumption (scale>=0) on DecimalType.scala:166.

 


> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Priority: Minor
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code:java}
> spark.sql("select some_int * 2.34E10 from t").show{code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
>  
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  
> My current workaround is to set 
> spark.sql.decimalOperations.allowPrecisionLoss to false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when scale is negative

2018-06-05 Thread Yifei Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifei Wu updated SPARK-24468:
-
Summary: DecimalType `adjustPrecisionScale` might fail when scale is 
negative  (was: DecimalType `adjustPrecisionScale` might fail when 
assert(scale>=0))

> DecimalType `adjustPrecisionScale` might fail when scale is negative
> 
>
> Key: SPARK-24468
> URL: https://issues.apache.org/jira/browse/SPARK-24468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifei Wu
>Priority: Minor
>
> Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.
> When multiplying a LongType with a decimal in scientific notation, say
> {code}
> spark.sql("select some_int * 2.34E10 from t").show{code}
> , decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be 
> casted as decimal(20,0).
> So according to the rules in comments:
> {code:java}
> /*
>  *   OperationResult PrecisionResult Scale
>  *   
>  *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
>  *   e1 * e2  p1 + p2 + 1 s1 + s2
>  *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
>  *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
>  *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
> */
> {code}
> their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
> assumption (scale>=0) on DecimalType.scala:166.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24468) DecimalType `adjustPrecisionScale` might fail when assert(scale>=0)

2018-06-05 Thread Yifei Wu (JIRA)
Yifei Wu created SPARK-24468:


 Summary: DecimalType `adjustPrecisionScale` might fail when 
assert(scale>=0)
 Key: SPARK-24468
 URL: https://issues.apache.org/jira/browse/SPARK-24468
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Yifei Wu


Hi, I am using MySQL JDBC Driver along with Spark to do some sql queries.

When multiplying a LongType with a decimal in scientific notation, say
{code}
spark.sql("select some_int * 2.34E10 from t").show{code}
, decimal 2.34E10 will be treated as decimal(3,-8), and some_int will be casted 
as decimal(20,0).

So according to the rules in comments:
{code:java}
/*
 *   OperationResult PrecisionResult Scale
 *   
 *   e1 + e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 - e2  max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
 *   e1 * e2  p1 + p2 + 1 s1 + s2
 *   e1 / e2  p1 - s1 + s2 + max(6, s1 + p2 + 1)  max(6, s1 + p2 + 1)
 *   e1 % e2  min(p1-s1, p2-s2) + max(s1, s2) max(s1, s2)
 *   e1 union e2  max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)
*/
{code}
their multiplication will be decimal(3+20+1,-8+0) and thus fails the assert 
assumption (scale>=0) on DecimalType.scala:166.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24466:


Assignee: (was: Apache Spark)

> TextSocketMicroBatchReader no longer works with nc utility
> --
>
> Key: SPARK-24466
> URL: https://issues.apache.org/jira/browse/SPARK-24466
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
> reading actual data so the query also exits with error.
>  
> The reason is due to launching temporary reader for reading schema, and 
> closing reader, and re-opening reader. While reliable socket server should be 
> able to handle this without any issue, nc command normally can't handle 
> multiple connections and simply exits when closing temporary reader.
>  
> Given that socket source is expected to be used from examples on official 
> document or some experiments, which we tend to simply use netcat, this is 
> better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24466:


Assignee: Apache Spark

> TextSocketMicroBatchReader no longer works with nc utility
> --
>
> Key: SPARK-24466
> URL: https://issues.apache.org/jira/browse/SPARK-24466
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
> reading actual data so the query also exits with error.
>  
> The reason is due to launching temporary reader for reading schema, and 
> closing reader, and re-opening reader. While reliable socket server should be 
> able to handle this without any issue, nc command normally can't handle 
> multiple connections and simply exits when closing temporary reader.
>  
> Given that socket source is expected to be used from examples on official 
> document or some experiments, which we tend to simply use netcat, this is 
> better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501498#comment-16501498
 ] 

Apache Spark commented on SPARK-24466:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/21497

> TextSocketMicroBatchReader no longer works with nc utility
> --
>
> Key: SPARK-24466
> URL: https://issues.apache.org/jira/browse/SPARK-24466
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
> reading actual data so the query also exits with error.
>  
> The reason is due to launching temporary reader for reading schema, and 
> closing reader, and re-opening reader. While reliable socket server should be 
> able to handle this without any issue, nc command normally can't handle 
> multiple connections and simply exits when closing temporary reader.
>  
> Given that socket source is expected to be used from examples on official 
> document or some experiments, which we tend to simply use netcat, this is 
> better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24453) Fix error recovering from the failure in a no-data batch

2018-06-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24453.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21491
[https://github.com/apache/spark/pull/21491]

> Fix error recovering from the failure in a no-data batch
> 
>
> Key: SPARK-24453
> URL: https://issues.apache.org/jira/browse/SPARK-24453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>
> ```
> java.lang.AssertionError: assertion failed: Concurrent update to the log. 
> Multiple streaming jobs detected for 159897
> ```
> The error occurs when we are recovering from a failure in a no-data batch 
> (say X) that has been planned (i.e. written to offset log) but not executed 
> (i.e. not written to commit log). Upon recovery, the following sequence of 
> events happen.
> - `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. 
> Since there was no data in the batch, the `availableOffsets` is same as 
> `committedOffsets`, so `isNewDataAvailable` is false.
> - When MicroBatchExecution.constructNextBatch is called, ideally it should 
> immediately return true because the next batch has already been constructed. 
> However, the check of whether the batch has been constructed was `if 
> (isNewDataAvailable) return true`. Since the planned batch is a no-data 
> batch, it escaped this check and proceeded to plan the same batch X once 
> again. And if there is new data since the failure, it does plan a new batch, 
> and try to write new offsets to the `offsetLog` as batchId X, and fail with 
> the above error.
> The correct solution is to check the offset log whether the currentBatchId is 
> the latest or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-05 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501351#comment-16501351
 ] 

Ismael Juma commented on SPARK-18057:
-

Yes, it is [~kabhwan]. The major version bump is simply because support for 
Java 7 has been dropped.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-05 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24418:

Description: 
Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple 
version bump. 

*loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
initialize the Spark before REPL sees any files.

Issue filed in Scala community.
https://github.com/scala/bug/issues/10913

  was:
Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
version bump. 

*loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
initialize the Spark before REPL sees any files.

Issue filed in Scala community.
https://github.com/scala/bug/issues/10913


> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24418:


Assignee: DB Tsai  (was: Apache Spark)

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24418:


Assignee: Apache Spark  (was: DB Tsai)

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501337#comment-16501337
 ] 

Apache Spark commented on SPARK-24418:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21495

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-05 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501334#comment-16501334
 ] 

Jungtaek Lim commented on SPARK-18057:
--

Is Kafka 2.0.0 client compatible with Kafka 1.x and 0.10.x brokers? I guess end 
users might hesitate to use the latest version in production, especially the 
major version is changed. Supporting broker version range is the most important 
thing to consider while upgrading.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24467) VectorAssemblerEstimator

2018-06-05 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24467:
-

 Summary: VectorAssemblerEstimator
 Key: SPARK-24467
 URL: https://issues.apache.org/jira/browse/SPARK-24467
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
`VectorSizeHint` instead of making `VectorAssembler` into an Estimator since I 
thought the latter option would break most workflows.  However, I should have 
proposed:
* Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
inputCols.  This Param can be optional.  If not given, then VectorAssembler 
will behave as it does now.  If given, then VectorAssembler can use that info 
instead of figuring out the Vector sizes via metadata or examining Rows in the 
data (though it could do consistency checks).
* Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
produces a VectorAssembler with the vector lengths Param specified.

This will not break existing workflows.  Migrating to VectorAssemblerEstimator 
will be easier than adding VectorSizeHint since it will not require users to 
manually input Vector lengths.

Note: Even with this Estimator, VectorSizeHint might prove useful for other 
things in the future which require vector length metadata, so we could consider 
keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org