[jira] [Commented] (SPARK-29239) Subquery should not cause NPE when eliminating subexpression

2019-09-25 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937495#comment-16937495
 ] 

Liang-Chi Hsieh commented on SPARK-29239:
-

I added SPARK-29221 to the title of the PR.

> Subquery should not cause NPE when eliminating subexpression
> 
>
> Key: SPARK-29239
> URL: https://issues.apache.org/jira/browse/SPARK-29239
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Subexpression elimination can possibly cause NPE when applying on execution 
> subquery expression like ScalarSubquery. It is because PlanExpression wraps 
> query plan. To compare query plan on executor when eliminating subexpression, 
> can cause unexpected error, like NPE when accessing transient fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29239) Subquery should not cause NPE when eliminating subexpression

2019-09-25 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937486#comment-16937486
 ] 

Liang-Chi Hsieh commented on SPARK-29239:
-

Yes.

> Subquery should not cause NPE when eliminating subexpression
> 
>
> Key: SPARK-29239
> URL: https://issues.apache.org/jira/browse/SPARK-29239
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Subexpression elimination can possibly cause NPE when applying on execution 
> subquery expression like ScalarSubquery. It is because PlanExpression wraps 
> query plan. To compare query plan on executor when eliminating subexpression, 
> can cause unexpected error, like NPE when accessing transient fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29239) Subquery should not cause NPE when eliminating subexpression

2019-09-25 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-29239:
---

 Summary: Subquery should not cause NPE when eliminating 
subexpression
 Key: SPARK-29239
 URL: https://issues.apache.org/jira/browse/SPARK-29239
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


Subexpression elimination can possibly cause NPE when applying on execution 
subquery expression like ScalarSubquery. It is because PlanExpression wraps 
query plan. To compare query plan on executor when eliminating subexpression, 
can cause unexpected error, like NPE when accessing transient fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29181) Cache preferred locations of checkpointed RDD

2019-09-19 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-29181.
-
Resolution: Duplicate

> Cache preferred locations of checkpointed RDD
> -
>
> Key: SPARK-29181
> URL: https://issues.apache.org/jira/browse/SPARK-29181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> One Spark job in our cluster fits many ALS models in parallel. The fitting 
> goes well, but in next when we union all factors, the union operation is very 
> slow.
> By looking into the driver stack dump, looks like the driver spends a lot of 
> time on computing preferred locations. As we checkpoint training data before 
> fitting ALS, the time is spent on 
> ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS 
> interface to query file status and block locations. As we have big number of 
> partitions derived from the checkpointed RDD,  the union will spend a lot of 
> time on querying the same information.
> This proposes to add a Spark config to control the caching behavior of 
> ReliableCheckpointRDD.getPreferredLocations. If it is enabled, 
> getPreferredLocations will only compute preferred locations once and cache it 
> for late usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29182) Cache preferred locations of checkpointed RDD

2019-09-19 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-29182:
---

Assignee: Liang-Chi Hsieh

> Cache preferred locations of checkpointed RDD
> -
>
> Key: SPARK-29182
> URL: https://issues.apache.org/jira/browse/SPARK-29182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> One Spark job in our cluster fits many ALS models in parallel. The fitting 
> goes well, but in next when we union all factors, the union operation is very 
> slow.
> By looking into the driver stack dump, looks like the driver spends a lot of 
> time on computing preferred locations. As we checkpoint training data before 
> fitting ALS, the time is spent on 
> ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS 
> interface to query file status and block locations. As we have big number of 
> partitions derived from the checkpointed RDD,  the union will spend a lot of 
> time on querying the same information.
> This proposes to add a Spark config to control the caching behavior of 
> ReliableCheckpointRDD.getPreferredLocations. If it is enabled, 
> getPreferredLocations will only compute preferred locations once and cache it 
> for late usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29181) Cache preferred locations of checkpointed RDD

2019-09-19 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933955#comment-16933955
 ] 

Liang-Chi Hsieh commented on SPARK-29181:
-

[~dongjoon] Thanks. Not aware of creating duplicate one.

> Cache preferred locations of checkpointed RDD
> -
>
> Key: SPARK-29181
> URL: https://issues.apache.org/jira/browse/SPARK-29181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> One Spark job in our cluster fits many ALS models in parallel. The fitting 
> goes well, but in next when we union all factors, the union operation is very 
> slow.
> By looking into the driver stack dump, looks like the driver spends a lot of 
> time on computing preferred locations. As we checkpoint training data before 
> fitting ALS, the time is spent on 
> ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS 
> interface to query file status and block locations. As we have big number of 
> partitions derived from the checkpointed RDD,  the union will spend a lot of 
> time on querying the same information.
> This proposes to add a Spark config to control the caching behavior of 
> ReliableCheckpointRDD.getPreferredLocations. If it is enabled, 
> getPreferredLocations will only compute preferred locations once and cache it 
> for late usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29182) Cache preferred locations of checkpointed RDD

2019-09-19 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-29182:
---

 Summary: Cache preferred locations of checkpointed RDD
 Key: SPARK-29182
 URL: https://issues.apache.org/jira/browse/SPARK-29182
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


One Spark job in our cluster fits many ALS models in parallel. The fitting goes 
well, but in next when we union all factors, the union operation is very slow.

By looking into the driver stack dump, looks like the driver spends a lot of 
time on computing preferred locations. As we checkpoint training data before 
fitting ALS, the time is spent on ReliableCheckpointRDD.getPreferredLocations. 
In this method, it will call DFS interface to query file status and block 
locations. As we have big number of partitions derived from the checkpointed 
RDD,  the union will spend a lot of time on querying the same information.

This proposes to add a Spark config to control the caching behavior of 
ReliableCheckpointRDD.getPreferredLocations. If it is enabled, 
getPreferredLocations will only compute preferred locations once and cache it 
for late usage.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29181) Cache preferred locations of checkpointed RDD

2019-09-19 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-29181:
---

 Summary: Cache preferred locations of checkpointed RDD
 Key: SPARK-29181
 URL: https://issues.apache.org/jira/browse/SPARK-29181
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


One Spark job in our cluster fits many ALS models in parallel. The fitting goes 
well, but in next when we union all factors, the union operation is very slow.

By looking into the driver stack dump, looks like the driver spends a lot of 
time on computing preferred locations. As we checkpoint training data before 
fitting ALS, the time is spent on ReliableCheckpointRDD.getPreferredLocations. 
In this method, it will call DFS interface to query file status and block 
locations. As we have big number of partitions derived from the checkpointed 
RDD,  the union will spend a lot of time on querying the same information.

This proposes to add a Spark config to control the caching behavior of 
ReliableCheckpointRDD.getPreferredLocations. If it is enabled, 
getPreferredLocations will only compute preferred locations once and cache it 
for late usage.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-18 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932784#comment-16932784
 ] 

Liang-Chi Hsieh commented on SPARK-29042:
-

[~hyukjin.kwon] Am I setting the fix versions and affects version correct after 
backport? Can you take a look? Thanks.

 

> Sampling-based RDD with unordered input should be INDETERMINATE
> ---
>
> Key: SPARK-29042
> URL: https://issues.apache.org/jira/browse/SPARK-29042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> We have found and fixed the correctness issue when RDD output is 
> INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is 
> order sensitive to its input. A sampling-based RDD with unordered input, 
> should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-18 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-29042:

Fix Version/s: 2.4.5

> Sampling-based RDD with unordered input should be INDETERMINATE
> ---
>
> Key: SPARK-29042
> URL: https://issues.apache.org/jira/browse/SPARK-29042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> We have found and fixed the correctness issue when RDD output is 
> INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is 
> order sensitive to its input. A sampling-based RDD with unordered input, 
> should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2019-09-18 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-22796.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25812
[https://github.com/apache/spark/pull/25812]

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2019-09-18 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-22796:
---

Assignee: Huaxin Gao

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930827#comment-16930827
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Regarding to AUC unstable issue, the nondeterministic training data, if not 
causing ArrayIndexOutOfBoundsException, can also cause wrong matching in 
computing factors. I don't have evidence that it is the reason. But it is 
possible.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select("id", "dpid", "score", "tag")
> 

[jira] [Commented] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930787#comment-16930787
 ] 

Liang-Chi Hsieh commented on SPARK-26205:
-

[~cloud_fan]. I see now. Created SPARK-29100.

> Optimize InSet expression for bytes, shorts, ints, dates
> 
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.0.0
>
>
> {{In}} expressions are compiled into a sequence of if-else statements, which 
> results in O\(n\) time complexity. {{InSet}} is an optimized version of 
> {{In}}, which is supposed to improve the performance if the number of 
> elements is big enough. However, {{InSet}} actually degrades the performance 
> in many cases due to various reasons (benchmarks were created in SPARK-26203 
> and solutions to the boxing problem are discussed in SPARK-26204).
> The main idea of this JIRA is to use Java {{switch}} statements to 
> significantly improve the performance of {{InSet}} expressions for bytes, 
> shorts, ints, dates. All {{switch}} statements are compiled into 
> {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have 
> O\(1\) time complexity if our case values are compact and {{tableswitch}} can 
> be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local 
> benchmarks show that this logic is more than two times faster even on 500+ 
> elements than using primitive collections in {{InSet}} expressions. As Spark 
> is using Scala {{HashSet}} right now, the performance gain will be is even 
> bigger.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-29100:
---

Assignee: Liang-Chi Hsieh

> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-29100:

Description: 
SPARK-26205 adds an optimization to InSet that generates Java switch condition 
for certain cases. When the given set is empty, it is possibly that codegen 
causes compilation error:

 

[info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
milliseconds)                                      

[info]   Code generation of input[0, int, true] INSET () failed:                
                                                        

[info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
"generated.java": Compiling "apply(java.lang.Object _i)"; 
apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
size 0, now 1                                                                   
                                        

[info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
"generated.java": Compiling "apply(java.lang.Object _i)"; 
apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
size 0, now 1                                                                   
                                        

[info]         at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
                                                                                
        

[info]         at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
               

[info]         at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)

  was:SPARK-26205 adds an optimization to InSet that generates Java switch 
condition for certain cases. When the given set is empty, it is possibly that 
codegen causes compilation error.


> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-29100:
---

 Summary: Codegen with switch in InSet expression causes 
compilation error
 Key: SPARK-29100
 URL: https://issues.apache.org/jira/browse/SPARK-29100
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


SPARK-26205 adds an optimization to InSet that generates Java switch condition 
for certain cases. When the given set is empty, it is possibly that codegen 
causes compilation error.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930658#comment-16930658
 ] 

Liang-Chi Hsieh edited comment on SPARK-28927 at 9/16/19 3:36 PM:
--

Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe after 2.4. Thus the 
nondeterministic behavior can also be triggered when a repartition call 
following a shuffle in Spark 2.2.1. I noticed that you have repartition after 
you read data using a sql. Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.


was (Author: viirya):
Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe after 2.4. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> 

[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930658#comment-16930658
 ] 

Liang-Chi Hsieh edited comment on SPARK-28927 at 9/16/19 3:35 PM:
--

Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe after 2.4. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.


was (Author: viirya):
Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> 

[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930658#comment-16930658
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for 

[jira] [Assigned] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-14 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-28927:
---

Assignee: Liang-Chi Hsieh

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends Serializable
> val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
> val als = new ALS
> val paramMap = 

[jira] [Assigned] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-13 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-29042:
---

Assignee: Liang-Chi Hsieh

> Sampling-based RDD with unordered input should be INDETERMINATE
> ---
>
> Key: SPARK-29042
> URL: https://issues.apache.org/jira/browse/SPARK-29042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
>
> We have found and fixed the correctness issue when RDD output is 
> INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is 
> order sensitive to its input. A sampling-based RDD with unordered input, 
> should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-13 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-29042.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25751
[https://github.com/apache/spark/pull/25751]

> Sampling-based RDD with unordered input should be INDETERMINATE
> ---
>
> Key: SPARK-29042
> URL: https://issues.apache.org/jira/browse/SPARK-29042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> We have found and fixed the correctness issue when RDD output is 
> INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is 
> order sensitive to its input. A sampling-based RDD with unordered input, 
> should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates

2019-09-13 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929334#comment-16929334
 ] 

Liang-Chi Hsieh commented on SPARK-26205:
-

[~cloud_fan] I ran a simple test, seems no failure happens?

val inSet = InSet(Literal(0), Set.empty)
checkEvaluation(inSet, false, row1)

I verified that it is codegen with genCodeWithSwitch.

> Optimize InSet expression for bytes, shorts, ints, dates
> 
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.0.0
>
>
> {{In}} expressions are compiled into a sequence of if-else statements, which 
> results in O\(n\) time complexity. {{InSet}} is an optimized version of 
> {{In}}, which is supposed to improve the performance if the number of 
> elements is big enough. However, {{InSet}} actually degrades the performance 
> in many cases due to various reasons (benchmarks were created in SPARK-26203 
> and solutions to the boxing problem are discussed in SPARK-26204).
> The main idea of this JIRA is to use Java {{switch}} statements to 
> significantly improve the performance of {{InSet}} expressions for bytes, 
> shorts, ints, dates. All {{switch}} statements are compiled into 
> {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have 
> O\(1\) time complexity if our case values are compact and {{tableswitch}} can 
> be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local 
> benchmarks show that this logic is more than two times faster even on 500+ 
> elements than using primitive collections in {{InSet}} expressions. As Spark 
> is using Scala {{HashSet}} right now, the performance gain will be is even 
> bigger.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates

2019-09-12 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928649#comment-16928649
 ] 

Liang-Chi Hsieh commented on SPARK-26205:
-

Yeah, I will look at it.

> Optimize InSet expression for bytes, shorts, ints, dates
> 
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.0.0
>
>
> {{In}} expressions are compiled into a sequence of if-else statements, which 
> results in O\(n\) time complexity. {{InSet}} is an optimized version of 
> {{In}}, which is supposed to improve the performance if the number of 
> elements is big enough. However, {{InSet}} actually degrades the performance 
> in many cases due to various reasons (benchmarks were created in SPARK-26203 
> and solutions to the boxing problem are discussed in SPARK-26204).
> The main idea of this JIRA is to use Java {{switch}} statements to 
> significantly improve the performance of {{InSet}} expressions for bytes, 
> shorts, ints, dates. All {{switch}} statements are compiled into 
> {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have 
> O\(1\) time complexity if our case values are compact and {{tableswitch}} can 
> be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local 
> benchmarks show that this logic is more than two times faster even on 500+ 
> elements than using primitive collections in {{InSet}} expressions. As Spark 
> is using Scala {{HashSet}} right now, the performance gain will be is even 
> bigger.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-10 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-29042:
---

 Summary: Sampling-based RDD with unordered input should be 
INDETERMINATE
 Key: SPARK-29042
 URL: https://issues.apache.org/jira/browse/SPARK-29042
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We have found and fixed the correctness issue when RDD output is INDETERMINATE. 
One missing part is sampling-based RDD. This kind of RDDs is order sensitive to 
its input. A sampling-based RDD with unordered input, should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-10 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926814#comment-16926814
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Hi [~JerryHouse], do you use any non-deterministic operations when preparing 
your training dataset, like sample, filtering based on random number, etc.?

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends Serializable
> 

[jira] [Assigned] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2019-09-09 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-23265:
---

Assignee: Huaxin Gao

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2019-09-09 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-23265.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 20442
[https://github.com/apache/spark/pull/20442]

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29013) Structurally equivalent subexpression elimination

2019-09-06 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-29013:
---

 Summary: Structurally equivalent subexpression elimination
 Key: SPARK-29013
 URL: https://issues.apache.org/jira/browse/SPARK-29013
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We do semantically equivalent subexpression elimination in SparkSQL. However, 
for some expressions that are not semantically equivalent, but structurally 
equivalent, current subexpression elimination generates too many similar 
functions. These functions share same computation structure but only differ in 
input slots of current processing row.

For such expressions, we can generate just one function, and pass in input 
slots during runtime.

It can reduce the length of generated code text, and save compilation time.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-09-01 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28933:

Fix Version/s: 3.0.0

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-09-01 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920584#comment-16920584
 ] 

Liang-Chi Hsieh commented on SPARK-28933:
-

This issue was resolved by [https://github.com/apache/spark/pull/25639].

 

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-09-01 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-28933.
-
Resolution: Resolved

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-09-01 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920550#comment-16920550
 ] 

Liang-Chi Hsieh commented on SPARK-28935:
-

Thanks! [~smilegator]

It should be helpful.

> Document SQL metrics for Details for Query Plan
> ---
>
> Key: SPARK-28935
> URL: https://issues.apache.org/jira/browse/SPARK-28935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://github.com/apache/spark/pull/25349] shows the query plans but it 
> does not describe the meaning of each metric in the plan. For end users, they 
> might not understand the meaning of the metrics we output. 
>  
> !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-01 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920480#comment-16920480
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Does this only happen on 2.2.1? How about current master branch?

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends Serializable
> val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
> val als = new ALS
> val paramMap = ParamMap(als.alpha -> 25000).
>

[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-31 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-23519:

Component/s: (was: Spark Core)

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Assignee: hemanth meka
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-31 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920302#comment-16920302
 ] 

Liang-Chi Hsieh commented on SPARK-23519:
-

This was closed and then reopened and fixed. The label 
[bulk-closed|https://issues.apache.org/jira/issues/?jql=labels+%3D+bulk-closed] 
looks not correct. I remove it. Feel free to add it back if I misunderstand it.

 

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Assignee: hemanth meka
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.0.0
>
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-31 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-23519:

Labels:   (was: bulk-closed)

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Assignee: hemanth meka
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-08-30 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919975#comment-16919975
 ] 

Liang-Chi Hsieh commented on SPARK-28935:
-

Thanks for pinging me! I will look into this.

> Document SQL metrics for Details for Query Plan
> ---
>
> Key: SPARK-28935
> URL: https://issues.apache.org/jira/browse/SPARK-28935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://github.com/apache/spark/pull/25349] shows the query plans but it 
> does not describe the meaning of each metric in the plan. For end users, they 
> might not understand the meaning of the metrics we output. 
>  
> !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28926) CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-30 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-28926.
-
Resolution: Duplicate

I think this is duplicate to SPARK-28927.

> CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for 
> datasets  with 12 billion instances
> 
>
> Key: SPARK-28926
> URL: https://issues.apache.org/jira/browse/SPARK-28926
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
> val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
>   .groupByKey().zipWithIndex()
>   .persist(StorageLevel.MEMORY_AND_DISK_SER)
> val predataUser = 
> predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2
>   

[jira] [Assigned] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-08-30 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-28933:
---

Assignee: Liang-Chi Hsieh

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-08-30 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-28933:
---

 Summary: Reduce unnecessary shuffle in ALS when initializing 
factors
 Key: SPARK-28933
 URL: https://issues.apache.org/jira/browse/SPARK-28933
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
current {{map}}, so we can preserve existing partition of the RDD of 
{{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. We 
don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28920) Set up java version for github workflow

2019-08-29 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-28920:
---

 Summary: Set up java version for github workflow
 Key: SPARK-28920
 URL: https://issues.apache.org/jira/browse/SPARK-28920
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We added java matrix to github workflow. As we want to build with JDK8/11, we 
should set up java version for mvn.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-26 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915809#comment-16915809
 ] 

Liang-Chi Hsieh commented on SPARK-23519:
-

I test with Hive 2.1. It doesn't support duplicate column names:
{code:java}
hive> create view test_view (c1, c2) as select c1, c1 from test;
FAILED: SemanticException [Error 10036]: Duplicate column name: c1
{code}
[~tafra...@gmail.com] you said Hive supports it, is newer versions of Hive 
supporting this?

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25549) High level API to collect RDD statistics

2019-08-25 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-25549.
-
Resolution: Won't Fix

> High level API to collect RDD statistics
> 
>
> Key: SPARK-25549
> URL: https://issues.apache.org/jira/browse/SPARK-25549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have low level API SparkContext.submitMapStage used for collecting 
> statistics of RDD. However it is too low level and is not so easy to use. We 
> need a high level API for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25549) High level API to collect RDD statistics

2019-08-25 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915362#comment-16915362
 ] 

Liang-Chi Hsieh commented on SPARK-25549:
-

Close this as it is not needed now.

> High level API to collect RDD statistics
> 
>
> Key: SPARK-25549
> URL: https://issues.apache.org/jira/browse/SPARK-25549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have low level API SparkContext.submitMapStage used for collecting 
> statistics of RDD. However it is too low level and is not so easy to use. We 
> need a high level API for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28866) Persist item factors RDD when checkpointing in ALS

2019-08-25 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-28866:
---

 Summary: Persist item factors RDD when checkpointing in ALS
 Key: SPARK-28866
 URL: https://issues.apache.org/jira/browse/SPARK-28866
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


In ALS ML implementation, if `implicitPrefs` is false, we checkpoint the RDD of 
item factors, between intervals. Before checkpointing and materializing RDD, 
this RDD was not persisted. It causes recomputation. In an experiment, there is 
performance difference between persisting and no persisting before 
checkpointing the RDD.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-08-24 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914950#comment-16914950
 ] 

Liang-Chi Hsieh commented on SPARK-24666:
-

I tried to run word2vec with Quora Question Pairs dataset. Set max iteration 
20, but can't reproduce this.


> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X.
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-22 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913915#comment-16913915
 ] 

Liang-Chi Hsieh commented on SPARK-23519:
-

Thanks for pinging me.

I am going on a flight soon. If this is not urgent, I can look into it after 
today.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28672) [UDF] Duplicate function creation should not allow

2019-08-19 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911007#comment-16911007
 ] 

Liang-Chi Hsieh commented on SPARK-28672:
-

Is there any rule in Hive regarding this? like disallow duplicate 
permanent/temporary functions, or resolving temporary/permanent function first 
when duplicating?

> [UDF] Duplicate function creation should not allow 
> ---
>
> Key: SPARK-28672
> URL: https://issues.apache.org/jira/browse/SPARK-28672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create function addm_3  AS 
> 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.084 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create temporary function addm_3  
> AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> INFO  : converting to local hdfs://hacluster/user/Multiply.jar
> INFO  : Added 
> [/tmp/8a396308-41f8-4335-9de4-8268ce5c70fe_resources/Multiply.jar] to class 
> path
> INFO  : Added resources: [hdfs://hacluster/user/Multiply.jar]
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.134 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> show functions like addm_3;
> +-+--+
> |function |
> +-+--+
> | addm_3  |
> | default.addm_3  |
> +-+--+
> 2 rows selected (0.047 seconds)
> {code}
> When show function executed it is listing both the function but what about 
> the db for permanent function when user has not specified.
> Duplicate should not be allowed if user creating temporary one with the same 
> name.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28761) spark.driver.maxResultSize only applies to compressed data

2019-08-16 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909420#comment-16909420
 ] 

Liang-Chi Hsieh commented on SPARK-28761:
-

If you do it at SparkPlan.scala#L344, isn't it just for SQL? 
{{spark.driver.maxResultSize}} covers RDD, right?

> spark.driver.maxResultSize only applies to compressed data
> --
>
> Key: SPARK-28761
> URL: https://issues.apache.org/jira/browse/SPARK-28761
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: David Vogelbacher
>Priority: Major
>
> Spark has a setting {{spark.driver.maxResultSize}}, see 
> https://spark.apache.org/docs/latest/configuration.html#application-properties
>  :
> {noformat}
> Limit of total size of serialized results of all partitions for each Spark 
> action (e.g. collect) in bytes. Should be at least 1M, or 0 for unlimited. 
> Jobs will be aborted if the total size is above this limit. Having a high 
> limit may cause out-of-memory errors in driver (depends on 
> spark.driver.memory and memory overhead of objects in JVM). 
> Setting a proper limit can protect the driver from out-of-memory errors.
> {noformat}
> This setting can be very useful in constraining the memory that the spark 
> driver needs for a specific spark action. However, this limit is checked 
> before decompressing data in 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L662
> Even if the compressed data is below the limit the uncompressed data can 
> still be far above. In order to protect the driver we should also impose a 
> limit on the uncompressed data. We could do this in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L344
> I propose adding a new config option 
> {{spark.driver.maxUncompressedResultSize}}.
> A simple repro of this with spark shell:
> {noformat}
> > printf 'a%.0s' {1..10} > test.csv # create a 100 MB file
> > ./bin/spark-shell --conf "spark.driver.maxResultSize=1"
> scala> val df = spark.read.format("csv").load("/Users/dvogelbacher/test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> scala> val results = df.collect()
> results: Array[org.apache.spark.sql.Row] = 
> Array([a...
> scala> results(0).getString(0).size
> res0: Int = 10
> {noformat}
> Even though we set maxResultSize to 10 MB, we collect a result that is 100MB 
> uncompressed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing

2019-08-16 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909409#comment-16909409
 ] 

Liang-Chi Hsieh commented on SPARK-28732:
-

As {{count}} return type is LongType, I think it is reasonable that it can't be 
fit into an Int column. The problem here might be the error is not friendly.

Normally, if we want to map dataset to specified type, an exception like this 
should be thrown, if it is incompatible:
{code}
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;   
 
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2801)
  
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2821)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2812)

{code}

> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java' when storing the result of a count aggregation in an integer
> ---
>
> Key: SPARK-28732
> URL: https://issues.apache.org/jira/browse/SPARK-28732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Alix Métivier
>Priority: Major
>
> I am using agg function on a dataset, and i want to count the number of lines 
> upon grouping columns. I would like to store the result of this count in an 
> integer, but it fails with this output : 
> {code}
> [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 89, Column 53: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
> "java.lang.Integer(java.lang.String)"
> Here is the line 89 and a few others to understand :
> /* 085 */ long value13 = i.getLong(5);
>  /* 086 */ argValue4 = value13;
>  /* 087 */
>  /* 088 */
>  /* 089 */ final java.lang.Integer value12 = false ? null : new 
> java.lang.Integer(argValue4);
> {code}
>  
> As per Integer documentation, there is not constructor for the type Long, so 
> this is why the generated code fails.
> Here is my code : 
> {code}
> org.apache.spark.sql.Dataset ds_row2 = 
> ds_conntAggregateRow_1_Out_1
>  .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
>  org.apache.spark.sql.functions.col("o_year").as("o_yearN"))
>  .agg(org.apache.spark.sql.functions.count("n_name").as("countN"),
>  .as(org.apache.spark.sql.Encoders.bean(row2Struct.class));
> {code}
> row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int
> If countN is a Long, code above wont fail
> If it is an Int, it works in 1.6 and 2.0, but fails on version 2.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when st

2019-08-16 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909409#comment-16909409
 ] 

Liang-Chi Hsieh edited comment on SPARK-28732 at 8/16/19 9:19 PM:
--

As {{count}} return type is LongType, I think it is reasonable that it can't be 
fit into an Int column. The problem here might be the error is not friendly.

Normally, if we want to map dataset to specified type, an exception like this 
should be thrown, if it is incompatible:
{code}
org.apache.spark.sql.AnalysisException: Cannot up cast `b` from bigint to int.  

  
The type path of the target object is:  

  
- field (class: "scala.Int", name: "b") 
   
- root class: "Test" 
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;   
 
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2801)
  
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2821)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2812)

{code}


was (Author: viirya):
As {{count}} return type is LongType, I think it is reasonable that it can't be 
fit into an Int column. The problem here might be the error is not friendly.

Normally, if we want to map dataset to specified type, an exception like this 
should be thrown, if it is incompatible:
{code}
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;   
 
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2801)
  
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2821)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2812)

{code}

> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java' when storing the result of a count aggregation in an integer
> ---
>
> Key: SPARK-28732
> URL: https://issues.apache.org/jira/browse/SPARK-28732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Alix Métivier
>Priority: Major
>
> I am using agg function on a dataset, and i want to count the number of lines 
> upon grouping columns. I would like to store the result of this count in an 
> integer, but it fails with this output : 
> {code}
> [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 89, Column 53: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
> "java.lang.Integer(java.lang.String)"
> Here is the line 89 and a few others to understand :
> /* 085 */ long value13 = i.getLong(5);
>  /* 086 */ argValue4 = value13;
>  /* 087 */
>  /* 088 */
>  /* 089 */ final java.lang.Integer value12 = false ? null : new 
> java.lang.Integer(argValue4);
> {code}
>  
> As per Integer documentation, there is not constructor for the type Long, so 
> this is why the generated code fails.
> Here is my code : 
> {code}
> org.apache.spark.sql.Dataset 

[jira] [Created] (SPARK-28722) Change sequential label sorting in StringIndexer fit to parallel

2019-08-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28722:
---

 Summary: Change sequential label sorting in StringIndexer fit to 
parallel
 Key: SPARK-28722
 URL: https://issues.apache.org/jira/browse/SPARK-28722
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


The fit method in StringIndexer sorts given labels in a sequential approach, if 
there are multiple input columns. When the number of input column increases, 
the time of label sorting dramatically increases too so it is hard to use in 
practice if dealing with hundreds of input columns.





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors

2019-08-11 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28652:

Priority: Minor  (was: Major)

> spark.kubernetes.pyspark.pythonVersion is never passed to executors
> ---
>
> Key: SPARK-28652
> URL: https://issues.apache.org/jira/browse/SPARK-28652
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: nanav yorbiz
>Priority: Minor
>
> I suppose this may not be a priority with Python2 on its way out, but given 
> that this setting is only ever sent to the driver and not the executors, no 
> actual work can be performed when the versions don't match, which will tend 
> to be *always* with the default setting for the driver being changed from 2 
> to 3, and the executors using `python`, which defaults to v2, by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors

2019-08-11 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28652:

Issue Type: Test  (was: Bug)

> spark.kubernetes.pyspark.pythonVersion is never passed to executors
> ---
>
> Key: SPARK-28652
> URL: https://issues.apache.org/jira/browse/SPARK-28652
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: nanav yorbiz
>Priority: Major
>
> I suppose this may not be a priority with Python2 on its way out, but given 
> that this setting is only ever sent to the driver and not the executors, no 
> actual work can be performed when the versions don't match, which will tend 
> to be *always* with the default setting for the driver being changed from 2 
> to 3, and the executors using `python`, which defaults to v2, by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors

2019-08-11 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904743#comment-16904743
 ] 

Liang-Chi Hsieh commented on SPARK-28652:
-

As existing tests don't explicitly check the Python version at executor side, I 
patched existing test to check Python version at executor side.

> spark.kubernetes.pyspark.pythonVersion is never passed to executors
> ---
>
> Key: SPARK-28652
> URL: https://issues.apache.org/jira/browse/SPARK-28652
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: nanav yorbiz
>Priority: Major
>
> I suppose this may not be a priority with Python2 on its way out, but given 
> that this setting is only ever sent to the driver and not the executors, no 
> actual work can be performed when the versions don't match, which will tend 
> to be *always* with the default setting for the driver being changed from 2 
> to 3, and the executors using `python`, which defaults to v2, by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors

2019-08-11 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904729#comment-16904729
 ] 

Liang-Chi Hsieh commented on SPARK-28652:
-

This looks interesting to me. I tried to look into existing tests. I think it 
is true that {{spark.kubernetes.pyspark.pythonVersion}} doesn't not pass into 
executors. But it looks correct and I think we don't need to pass it.

The python version used by executors is come from Python side at driver, when 
wrapping a python function. PythonRunner will later serialize this variable 
when it is going to invoke python workers. PythonWorkerFactory also uses this 
variable to determine which python executable to run. So in executors, to run 
which python executable is not determined by PYSPARK_PYTHON. It means that we 
don't need to pass spark.kubernetes.pyspark.pythonVersion to executors, as this 
config is only used to choose PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON.

cc [~hyukjin.kwon] too, in case if I miss something.


> spark.kubernetes.pyspark.pythonVersion is never passed to executors
> ---
>
> Key: SPARK-28652
> URL: https://issues.apache.org/jira/browse/SPARK-28652
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: nanav yorbiz
>Priority: Major
>
> I suppose this may not be a priority with Python2 on its way out, but given 
> that this setting is only ever sent to the driver and not the executors, no 
> actual work can be performed when the versions don't match, which will tend 
> to be *always* with the default setting for the driver being changed from 2 
> to 3, and the executors using `python`, which defaults to v2, by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899788#comment-16899788
 ] 

Liang-Chi Hsieh commented on SPARK-28422:
-

Thanks [~dongjoon]!


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:python}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2019-07-21 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889742#comment-16889742
 ] 

Liang-Chi Hsieh commented on SPARK-24152:
-

Ok. I think it was fixed.

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2019-07-21 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889671#comment-16889671
 ] 

Liang-Chi Hsieh commented on SPARK-24152:
-

This CRAN issue is happening now, again. Emailed to CRAN admins for help. Will 
update after they reply.


> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes

2019-07-19 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28441:

Summary: PythonUDF used in correlated scalar subquery causes   (was: 
udf(max(udf(column))) throws java.lang.UnsupportedOperationException: Cannot 
evaluate expression: udf(null))

> PythonUDF used in correlated scalar subquery causes 
> 
>
> Key: SPARK-28441
> URL: https://issues.apache.org/jira/browse/SPARK-28441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I found this when doing https://issues.apache.org/jira/browse/SPARK-28277
>  
> {code:java}
> >>> @pandas_udf("string", PandasUDFType.SCALAR)
> ... def noop(x):
> ...     return x.apply(str)
> ... 
> >>> spark.udf.register("udf", noop)
> 
> >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values 
> >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)")
> DataFrame[]
> >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values 
> >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as 
> >>> t2(k, v)")
> DataFrame[]
> >>> spark.sql("SELECT t1.k FROM t1 WHERE  t1.v <= (SELECT   
> >>> udf(max(udf(t2.v))) FROM     t2 WHERE    udf(t2.k) = udf(t1.k))").show()
> py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> udf(null)
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296)
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295)
>  at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52)
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes UnsupportedOperationException

2019-07-19 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28441:

Summary: PythonUDF used in correlated scalar subquery causes 
UnsupportedOperationException   (was: PythonUDF used in correlated scalar 
subquery causes )

> PythonUDF used in correlated scalar subquery causes 
> UnsupportedOperationException 
> --
>
> Key: SPARK-28441
> URL: https://issues.apache.org/jira/browse/SPARK-28441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I found this when doing https://issues.apache.org/jira/browse/SPARK-28277
>  
> {code:java}
> >>> @pandas_udf("string", PandasUDFType.SCALAR)
> ... def noop(x):
> ...     return x.apply(str)
> ... 
> >>> spark.udf.register("udf", noop)
> 
> >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values 
> >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)")
> DataFrame[]
> >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values 
> >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as 
> >>> t2(k, v)")
> DataFrame[]
> >>> spark.sql("SELECT t1.k FROM t1 WHERE  t1.v <= (SELECT   
> >>> udf(max(udf(t2.v))) FROM     t2 WHERE    udf(t2.k) = udf(t1.k))").show()
> py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> udf(null)
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296)
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295)
>  at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52)
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes UnsupportedOperationException

2019-07-19 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28441:

Priority: Major  (was: Minor)

> PythonUDF used in correlated scalar subquery causes 
> UnsupportedOperationException 
> --
>
> Key: SPARK-28441
> URL: https://issues.apache.org/jira/browse/SPARK-28441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> I found this when doing https://issues.apache.org/jira/browse/SPARK-28277
>  
> {code:java}
> >>> @pandas_udf("string", PandasUDFType.SCALAR)
> ... def noop(x):
> ...     return x.apply(str)
> ... 
> >>> spark.udf.register("udf", noop)
> 
> >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values 
> >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)")
> DataFrame[]
> >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values 
> >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as 
> >>> t2(k, v)")
> DataFrame[]
> >>> spark.sql("SELECT t1.k FROM t1 WHERE  t1.v <= (SELECT   
> >>> udf(max(udf(t2.v))) FROM     t2 WHERE    udf(t2.k) = udf(t1.k))").show()
> py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> udf(null)
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296)
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295)
>  at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52)
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-18 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887707#comment-16887707
 ] 

Liang-Chi Hsieh commented on SPARK-28288:
-

Those errors can be found in original window.sql. Seems fine.

> Convert and port 'window.sql' into UDF test base
> 
>
> Key: SPARK-28288
> URL: https://issues.apache.org/jira/browse/SPARK-28288
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28365) Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM

2019-07-14 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28365:

Summary: Fallback locale to en_US in StopWordsRemover if system default 
locale isn't in available locales in JVM  (was: Set default locale param for 
StopWordsRemover to en_US if system default locale isn't in available locales 
in JVM)

> Fallback locale to en_US in StopWordsRemover if system default locale isn't 
> in available locales in JVM
> ---
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM

2019-07-14 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28365:

Summary: Set default locale param for StopWordsRemover to en_US if system 
default locale isn't in available locales in JVM  (was: Set default locale for 
StopWordsRemover tests to prevent invalid locale error during test)

> Set default locale param for StopWordsRemover to en_US if system default 
> locale isn't in available locales in JVM
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM

2019-07-14 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28365:

Priority: Major  (was: Minor)

> Set default locale param for StopWordsRemover to en_US if system default 
> locale isn't in available locales in JVM
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM

2019-07-14 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28365:

Component/s: (was: PySpark)

> Set default locale param for StopWordsRemover to en_US if system default 
> locale isn't in available locales in JVM
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM

2019-07-14 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28365:

Issue Type: Bug  (was: Test)

> Set default locale param for StopWordsRemover to en_US if system default 
> locale isn't in available locales in JVM
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test

2019-07-14 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28365:

Component/s: (was: Tests)
 ML

> Set default locale for StopWordsRemover tests to prevent invalid locale error 
> during test
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test

2019-07-14 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28365:

Description: 
Because the local default locale isn't in available locales at {{Locale}}, when 
I did some tests locally with python code, {{StopWordsRemover}} related python 
test hits some errors, like:

{code}
Traceback (most recent call last):
  File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
test_stopwordsremover
stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
  File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
return func(self, **kwargs)
  File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
self.uid)
  File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
  File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
1554, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
raise converted
pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
parameter locale given invalid value en_TW.'
{code}

As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, it 
is better to have a workable locale if system default locale can't be found in 
available locales in JVM. Otherwise, users have to manually change system 
locale or accessing a private property _jvm in PySpark.

  was:
Because the local default locale isn't in available locales at {{Locale}}, when 
I did some tests locally with python code, {{StopWordsRemover}} related python 
test hits some errors, like:

{code}
Traceback (most recent call last):
  File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
test_stopwordsremover
stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
  File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
return func(self, **kwargs)
  File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
self.uid)
  File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
  File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
1554, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
raise converted
pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
parameter locale given invalid value en_TW.'
{code}


> Set default locale for StopWordsRemover tests to prevent invalid locale error 
> during test
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28381) Upgraded version of Pyrolite to 4.30

2019-07-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28381:
---

 Summary: Upgraded version of Pyrolite to 4.30
 Key: SPARK-28381
 URL: https://issues.apache.org/jira/browse/SPARK-28381
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


This upgraded to a newer version of Pyrolite. Most updates in the newer version 
are for dotnet. For java, it includes a bug fix to Unpickler regarding cleaning 
up Unpickler memo, and support of protocol 5.

 

After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28378) Remove usage of cgi.escape

2019-07-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28378:
---

 Summary: Remove usage of cgi.escape
 Key: SPARK-28378
 URL: https://issues.apache.org/jira/browse/SPARK-28378
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


{{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to replace 
it.

[1] [https://docs.python.org/3/library/cgi.html#cgi.escape].
[2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test

2019-07-12 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28365:
---

 Summary: Set default locale for StopWordsRemover tests to prevent 
invalid locale error during test
 Key: SPARK-28365
 URL: https://issues.apache.org/jira/browse/SPARK-28365
 Project: Spark
  Issue Type: Test
  Components: PySpark, Tests
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


Because the local default locale isn't in available locales at {{Locale}}, when 
I did some tests locally with python code, {{StopWordsRemover}} related python 
test hits some errors, like:

{code}
Traceback (most recent call last):
  File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
test_stopwordsremover
stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
  File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
return func(self, **kwargs)
  File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
self.uid)
  File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
  File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
1554, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
raise converted
pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
parameter locale given invalid value en_TW.'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28345) PythonUDF predicate should be able to pushdown to join

2019-07-11 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882712#comment-16882712
 ] 

Liang-Chi Hsieh commented on SPARK-28345:
-

I found this when doing SPARK-28276.

> PythonUDF predicate should be able to pushdown to join
> --
>
> Key: SPARK-28345
> URL: https://issues.apache.org/jira/browse/SPARK-28345
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> A Filter predicate using PythonUDF can't be push down into join condition, 
> currently. A predicate like that should be able to push down to join 
> condition. For PythonUDFs that can't be evaluated in join condition, 
> {{PullOutPythonUDFInJoinCondition}} will pull them out later.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28345) PythonUDF predicate should be able to pushdown to join

2019-07-10 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28345:
---

 Summary: PythonUDF predicate should be able to pushdown to join
 Key: SPARK-28345
 URL: https://issues.apache.org/jira/browse/SPARK-28345
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


A Filter predicate using PythonUDF can't be push down into join condition, 
currently. A predicate like that should be able to push down to join condition. 
For PythonUDFs that can't be evaluated in join condition, 
{{PullOutPythonUDFInJoinCondition}} will pull them out later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28323) PythonUDF should be able to use in join condition

2019-07-09 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881717#comment-16881717
 ] 

Liang-Chi Hsieh commented on SPARK-28323:
-

I found this bug when doing SPARK-28276.

> PythonUDF should be able to use in join condition
> -
>
> Key: SPARK-28323
> URL: https://issues.apache.org/jira/browse/SPARK-28323
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> There is a bug in {{ExtractPythonUDFs}} that produces wrong result 
> attributes. It causes a failure when using PythonUDFs among multiple child 
> plans, e.g., join. An example is using PythonUDFs in join condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28323) PythonUDF should be able to use in join condition

2019-07-09 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28323:
---

 Summary: PythonUDF should be able to use in join condition
 Key: SPARK-28323
 URL: https://issues.apache.org/jira/browse/SPARK-28323
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


There is a bug in {{ExtractPythonUDFs}} that produces wrong result attributes. 
It causes a failure when using PythonUDFs among multiple child plans, e.g., 
join. An example is using PythonUDFs in join condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base

2019-07-09 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881033#comment-16881033
 ] 

Liang-Chi Hsieh commented on SPARK-28276:
-

Will look into this.

> Convert and port 'cross-join.sql' into UDF test base
> 
>
> Key: SPARK-28276
> URL: https://issues.apache.org/jira/browse/SPARK-28276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24152) SparkR CRAN feasibility check server problem

2019-07-02 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877497#comment-16877497
 ] 

Liang-Chi Hsieh edited comment on SPARK-24152 at 7/3/19 5:19 AM:
-

Received reply that is cleaned up. So I think it is fine now.


was (Author: viirya):
Received reply that is cleaned up.

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2019-07-02 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877497#comment-16877497
 ] 

Liang-Chi Hsieh commented on SPARK-24152:
-

Received reply that is cleaned up.

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2019-07-02 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877401#comment-16877401
 ] 

Liang-Chi Hsieh commented on SPARK-24152:
-

I noticed that this issue happens now again. Contacted CRAN admin and asked for 
help. Will update when they reply.

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28215) as_tibble was removed from Arrow R API

2019-06-29 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28215:
---

 Summary: as_tibble was removed from Arrow R API
 Key: SPARK-28215
 URL: https://issues.apache.org/jira/browse/SPARK-28215
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


New R api of Arrow has removed `as_tibble`. Arrow optimized collect in R 
doesn't work now due to the change.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-06-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872852#comment-16872852
 ] 

Liang-Chi Hsieh commented on SPARK-22340:
-

[~hyukjin.kwon] Should we reopen this as you are open a PR for it now?

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>Priority: Major
>  Labels: bulk-closed
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

2019-06-20 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868577#comment-16868577
 ] 

Liang-Chi Hsieh commented on SPARK-28079:
-

{{columnNameOfCorruptRecord}} currently applied only when users explicitly 
specify it in the user-defined schema, as documented. I think you are proposing 
to let Spark SQL add {{columnNameOfCorruptRecord}} column automatically, 
without explicit user-defined schema?



> CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is 
> manually added to the schema
> -
>
> Key: SPARK-28079
> URL: https://issues.apache.org/jira/browse/SPARK-28079
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.3
>Reporter: F Jimenez
>Priority: Major
>
> When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged 
> as such and read in. Only way to get them flagged is to manually set 
> "columnNameOfCorruptRecord" AND manually setting the schema including this 
> column. Example:
> {code:java}
> // Second row has a 4th column that is not declared in the header/schema
> val csvText = s"""
>  | FieldA, FieldB, FieldC
>  | a1,b1,c1
>  | a2,b2,c2,d*""".stripMargin
> val csvFile = new File("/tmp/file.csv")
> FileUtils.write(csvFile, csvText)
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
>   .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> This produces the correct result:
> {code:java}
> ++--+--+--+
> |corrupt |fieldA|fieldB|fieldC|
> ++--+--+--+
> |null    | a1   |b1    |c1    |
> | a2,b2,c2,d*| a2   |b2    |c2    |
> ++--+--+--+
> {code}
> However removing the "schema" option and going:
> {code:java}
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> Yields:
> {code:java}
> +---+---+---+
> | FieldA| FieldB| FieldC|
> +---+---+---+
> | a1    |b1 |c1 |
> | a2    |b2 |c2 |
> +---+---+---+
> {code}
> The fourth value "d*" in the second row has been removed and the row not 
> marked as corrupt
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27946) Hive DDL to Spark DDL conversion USING "show create table"

2019-06-19 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867771#comment-16867771
 ] 

Liang-Chi Hsieh commented on SPARK-27946:
-

[~smilegator] Thanks for pinging me. I'd like to do, although I'm a little busy 
these days. I will try to work on it in this weekend. If others are interested 
to work on this, please take it.

> Hive DDL to Spark DDL conversion USING "show create table"
> --
>
> Key: SPARK-27946
> URL: https://issues.apache.org/jira/browse/SPARK-27946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Many users migrate tables created with Hive DDL to Spark. Defining the table 
> with Spark DDL brings performance benefits. We need to add a feature to Show 
> Create Table that allows you to generate Spark DDL for a table. For example: 
> `SHOW CREATE TABLE customers AS SPARK`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

2019-06-18 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866777#comment-16866777
 ] 

Liang-Chi Hsieh commented on SPARK-28079:
-

Isn't it the expected behavior as documented in {{PERMISSIVE}} mode of CSV?

> CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is 
> manually added to the schema
> -
>
> Key: SPARK-28079
> URL: https://issues.apache.org/jira/browse/SPARK-28079
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.3
>Reporter: F Jimenez
>Priority: Major
>
> When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged 
> as such and read in. Only way to get them flagged is to manually set 
> "columnNameOfCorruptRecord" AND manually setting the schema including this 
> column. Example:
> {code:java}
> // Second row has a 4th column that is not declared in the header/schema
> val csvText = s"""
>  | FieldA, FieldB, FieldC
>  | a1,b1,c1
>  | a2,b2,c2,d*""".stripMargin
> val csvFile = new File("/tmp/file.csv")
> FileUtils.write(csvFile, csvText)
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
>   .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> This produces the correct result:
> {code:java}
> ++--+--+--+
> |corrupt |fieldA|fieldB|fieldC|
> ++--+--+--+
> |null    | a1   |b1    |c1    |
> | a2,b2,c2,d*| a2   |b2    |c2    |
> ++--+--+--+
> {code}
> However removing the "schema" option and going:
> {code:java}
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> Yields:
> {code:java}
> +---+---+---+
> | FieldA| FieldB| FieldC|
> +---+---+---+
> | a1    |b1 |c1 |
> | a2    |b2 |c2 |
> +---+---+---+
> {code}
> The fourth value "d*" in the second row has been removed and the row not 
> marked as corrupt
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865714#comment-16865714
 ] 

Liang-Chi Hsieh edited comment on SPARK-28058 at 6/17/19 3:59 PM:
--

[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+--+--+
|fruit |color |
+--+--+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+--+--+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

This seems to be inherited from Univocity parser, when we use 
{{CsvParserSettings.selectIndexes}} to do field selection. In above case, the 
parser returns two tokens where the second token is just null. I'm not sure if 
it is known behavior of Univocity parser, or it is a bug at Univocity parser.


was (Author: viirya):
[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+--+--+
|fruit |color |
+--+--+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+--+--+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865714#comment-16865714
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+--+--+
|fruit |color |
+--+--+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+--+--+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865695#comment-16865695
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

[~stwhit] Thanks for letting us know that!

Although it is in the migration guide, for new users it should be good to have 
the note in {{DROPMALFORMED}} doc. I filed SPARK-28082 to track the doc 
improvement.



> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28082:
---

 Summary: Add a note to DROPMALFORMED mode of CSV for column pruning
 Key: SPARK-28082
 URL: https://issues.apache.org/jira/browse/SPARK-28082
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


This is inspired by SPARK-28058.

When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
malformed columns aren't read. This behavior is due to CSV parser column 
pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of column 
pruning. Users will be confused by the fact that {{DROPMALFORMED}} mode doesn't 
work as expected.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865664#comment-16865664
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

Although this isn't a bug, I think it might be worth adding a note to current 
doc to explain/clarify it.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865650#comment-16865650
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

This is due to CSV parser column pruning. You can disable it and the behavior 
would like you expect:
{code:java}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
+--+
|fruit |
+--+
|apple |
|banana|
|orange|
|xxx   |
+--+

scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)

scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
+--+
|fruit |
+--+
|apple |
|banana|
|orange|
+--+
{code}

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28054) Unable to insert partitioned table dynamically when partition name is upper case

2019-06-16 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865006#comment-16865006
 ] 

Liang-Chi Hsieh commented on SPARK-28054:
-

I tested on Hive, the query works. Btw, the issue is also reproducible on 
current master.

> Unable to insert partitioned table dynamically when partition name is upper 
> case
> 
>
> Key: SPARK-28054
> URL: https://issues.apache.org/jira/browse/SPARK-28054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: ChenKai
>Priority: Major
>
> {code:java}
> -- create sql and column name is upper case
> CREATE TABLE src (KEY STRING, VALUE STRING) PARTITIONED BY (DS STRING)
> -- insert sql
> INSERT INTO TABLE src PARTITION(ds) SELECT 'k' key, 'v' value, '1' ds
> {code}
> The error is:
> {code:java}
> Error in query: 
> org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: 
> Partition spec {ds=, DS=1} contains non-partition columns;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28054) Unable to insert partitioned table dynamically when partition name is upper case

2019-06-14 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864189#comment-16864189
 ] 

Liang-Chi Hsieh commented on SPARK-28054:
-

Is this query working on Hive?

> Unable to insert partitioned table dynamically when partition name is upper 
> case
> 
>
> Key: SPARK-28054
> URL: https://issues.apache.org/jira/browse/SPARK-28054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: ChenKai
>Priority: Major
>
> {code:java}
> -- create sql and column name is upper case
> CREATE TABLE src (KEY INT, VALUE STRING) PARTITIONED BY (DS STRING)
> -- insert sql
> INSERT INTO TABLE src PARTITION(ds) SELECT 'k' key, 'v' value, '1' ds
> {code}
> The error is:
> {code:java}
> Error in query: 
> org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: 
> Partition spec {ds=, DS=1} contains non-partition columns;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value

2019-06-14 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864107#comment-16864107
 ] 

Liang-Chi Hsieh commented on SPARK-28043:
-

To make duplicate JSON keys work, I think about it and look at our current 
implementation. One concern is that how do we know which key maps to which 
Spark SQL field?

Suppose we have two duplicate keys "a" as above. We infer the schema of Spark 
SQL as "a string, a string". Does the order of keys in JSON string imply the 
order of fields? In our current implementation, such mapping doesn't exist. It 
means the order of keys can be different in each JSON string.

Isn't it prone to unware error when reading JSON?

Another option to forbid duplicate JSON keys. Maybe add a legacy config for 
fallback to current behavior, if we don't want to break existing code?





> Reading json with duplicate columns drops the first column value
> 
>
> Key: SPARK-28043
> URL: https://issues.apache.org/jira/browse/SPARK-28043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When reading a JSON blob with duplicate fields, Spark appears to ignore the 
> value of the first one. JSON recommends unique names but does not require it; 
> since JSON and Spark SQL both allow duplicate field names, we should fix the 
> bug where the first column value is getting dropped.
>  
> I'm guessing somewhere when parsing JSON, we're turning it into a Map which 
> is causing the first value to be overridden.
>  
> Repro (Python, 2.4):
> {code}
> scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", 
> \"a\": \"blah2\"} ]"))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at 
> parallelize at :23
> scala> val df = spark.read.json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [a: string, a: string]   
>   
> scala> df.show
> ++-+
> |   a|a|
> ++-+
> |null|blah2|
> ++-+
> {code}
>  
> The expected response would be:
> {code}
> ++-+
> |   a|a|
> ++-+
> |blah|blah2|
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value

2019-06-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863680#comment-16863680
 ] 

Liang-Chi Hsieh commented on SPARK-28043:
-

I tried to look around that, like 
https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object.

So JSON doesn't disallow duplicate keys. Spark SQL doesn't disallow duplicate 
field names, although it can be impose some difficulties when using a DataFrame 
with duplicate field names. To clarify it, just because Spark SQL allows 
duplicate field names that doesn't mean that we should use such feature. But I 
think that, to some extent, the current behavior isn't consistent.

{code}
scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", 
\"a\": \"blah2\"} ]"))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at 
parallelize at :23
scala> val df = spark.read.json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [a: string, a: string] 
scala> df.show
++-+
|   a|a|
++-+
|null|blah2|
++-+
{code}

> Reading json with duplicate columns drops the first column value
> 
>
> Key: SPARK-28043
> URL: https://issues.apache.org/jira/browse/SPARK-28043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When reading a JSON blob with duplicate fields, Spark appears to ignore the 
> value of the first one. JSON recommends unique names but does not require it; 
> since JSON and Spark SQL both allow duplicate field names, we should fix the 
> bug where the first column value is getting dropped.
>  
> I'm guessing somewhere when parsing JSON, we're turning it into a Map which 
> is causing the first value to be overridden.
>  
> Repro (Python, 2.4):
> >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": 
> >>> \"blah2\"}"])
>  >>> df = spark.read.json(jsonRDD)
>  >>> df.show()
>  +-++
> |a|a|
> +-++
> |null|blah2|
> +-++
>  
> The expected response would be:
> +-++
> |a|a|
> +-++
> |blah|blah2|
> +-++



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863180#comment-16863180
 ] 

Liang-Chi Hsieh commented on SPARK-28006:
-

I'm curious about two questions:

Can we use pandas agg udfs as window function?

Because the proposed GROUPED_XFORM udf calculates output values for all rows in 
the group, looks like the proposed GROUPED_XFORM udf can only use window frame 
(UnboundedPreceding, UnboundedFollowing)?

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863030#comment-16863030
 ] 

Liang-Chi Hsieh commented on SPARK-27966:
-

I can't see where input_file_name is, from the truncated output. If it is just 
in the Project, I can't tell why it doesn't work. If there is no good 
reproducer, I agree with [~hyukjin.kwon] that we may resolve this JIRA.



> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >