[jira] [Commented] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8

2018-08-12 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577846#comment-16577846
 ] 

Saisai Shao commented on SPARK-25020:
-

Yes, we also met the same issue. This is more like a Hadoop problem rather than 
a Spark problem. Would you please also report this issue to Hadoop community?

CC [~ste...@apache.org], I think I talked to you the same issue before, I think 
this is something should be fixed in the Hadoop side.

> Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8
> --
>
> Key: SPARK-25020
> URL: https://issues.apache.org/jira/browse/SPARK-25020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: Spark Streaming
> -- Tested on 2.2 & 2.3 (more than likely affects all versions with graceful 
> shutdown) 
> Hadoop 2.8
>Reporter: Ricky Saltzer
>Priority: Major
>
> Opening this up to give you guys some insight in an issue that will occur 
> when using Spark Streaming with Hadoop 2.8. 
> Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout 
> for its shutdown hook. From our tests, if the Spark job takes longer than 10 
> seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample 
> over the graceful shutdown and throw an exception like
> {code:java}
> 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
> at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681)
> at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599)
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> The reason I hit this issue is because we recently upgraded to EMR 5.15, 
> which has both Spark 2.3 & Hadoop 2.8. The following workaround has proven 
> successful to us (in limited testing)
> Instead of just running
> {code:java}
> ...
> ssc.start()
> ssc.awaitTermination(){code}
> We needed to do the following
> {code:java}
> ...
> ssc.start()
> sys.ShutdownHookThread {
>   ssc.stop(true, true)
> }
> ssc.awaitTermination(){code}
> As far as I can tell, there is no way to override the default {{10 second}} 
> timeout in HADOOP-12950, which is why we had to go with the workaround. 
> Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 
> 2.2.x & Hadoop 2.8. 
> Ricky
>  Epic Games



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (SPARK-25096) Loosen nullability if the cast is force-nullable.

2018-08-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25096:


Assignee: Apache Spark

> Loosen nullability if the cast is force-nullable.
> -
>
> Key: SPARK-25096
> URL: https://issues.apache.org/jira/browse/SPARK-25096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> In type coercion for complex types, if the found type is force-nullable to 
> cast, we should loosen the nullability to be able to cast. Also for map key 
> type, we can't use the type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25096) Loosen nullability if the cast is force-nullable.

2018-08-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25096:


Assignee: (was: Apache Spark)

> Loosen nullability if the cast is force-nullable.
> -
>
> Key: SPARK-25096
> URL: https://issues.apache.org/jira/browse/SPARK-25096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> In type coercion for complex types, if the found type is force-nullable to 
> cast, we should loosen the nullability to be able to cast. Also for map key 
> type, we can't use the type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25096) Loosen nullability if the cast is force-nullable.

2018-08-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577793#comment-16577793
 ] 

Apache Spark commented on SPARK-25096:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22086

> Loosen nullability if the cast is force-nullable.
> -
>
> Key: SPARK-25096
> URL: https://issues.apache.org/jira/browse/SPARK-25096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> In type coercion for complex types, if the found type is force-nullable to 
> cast, we should loosen the nullability to be able to cast. Also for map key 
> type, we can't use the type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25096) Loosen nullability if the cast is force-nullable.

2018-08-12 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-25096:
-

 Summary: Loosen nullability if the cast is force-nullable.
 Key: SPARK-25096
 URL: https://issues.apache.org/jira/browse/SPARK-25096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takuya Ueshin


In type coercion for complex types, if the found type is force-nullable to 
cast, we should loosen the nullability to be able to cast. Also for map key 
type, we can't use the type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577763#comment-16577763
 ] 

Hyukjin Kwon commented on SPARK-25094:
--

[~igreenfi], mind adding a reproducer so that we can verify and resolve this 
later?

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25086) Incorrect Default Value For "escape" For CSV Files

2018-08-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25086:
-
Component/s: (was: Spark Core)
 SQL

> Incorrect Default Value For "escape" For CSV Files
> --
>
> Key: SPARK-25086
> URL: https://issues.apache.org/jira/browse/SPARK-25086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Wilcox
>Priority: Major
>
> The RFC for CSV files ([https://tools.ietf.org/html/rfc4180]) indicates that 
> the way that a double-quote is escaped is by preceding it with another 
> double-quote:
> {code:java}
> 7. If double-quotes are used to enclose fields, then a double-quote appearing 
> inside a field must be escaped by preceding it with another double quote. For 
> example: "aaa","b""bb","ccc"{code}
> Your default value for "escape" violates the RFC. I think that we should fix 
> the default value to be {{"}}, and those that want {{\}} to escape can 
> override for non-RFC-conforming CSV files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25086) Incorrect Default Value For "escape" For CSV Files

2018-08-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577761#comment-16577761
 ] 

Hyukjin Kwon commented on SPARK-25086:
--

I think this is a duplicate of SPARK-22236

> Incorrect Default Value For "escape" For CSV Files
> --
>
> Key: SPARK-25086
> URL: https://issues.apache.org/jira/browse/SPARK-25086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Wilcox
>Priority: Major
>
> The RFC for CSV files ([https://tools.ietf.org/html/rfc4180]) indicates that 
> the way that a double-quote is escaped is by preceding it with another 
> double-quote:
> {code:java}
> 7. If double-quotes are used to enclose fields, then a double-quote appearing 
> inside a field must be escaped by preceding it with another double quote. For 
> example: "aaa","b""bb","ccc"{code}
> Your default value for "escape" violates the RFC. I think that we should fix 
> the default value to be {{"}}, and those that want {{\}} to escape can 
> override for non-RFC-conforming CSV files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25033) Bump Apache commons.{httpclient, httpcore}

2018-08-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25033:


Assignee: Fokko Driesprong

> Bump Apache commons.{httpclient, httpcore}
> --
>
> Key: SPARK-25033
> URL: https://issues.apache.org/jira/browse/SPARK-25033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.0
>
>
> I would like to bump the versions to make it up to date with my other 
> dependencies, in my case Stocator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25033) Bump Apache commons.{httpclient, httpcore}

2018-08-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25033.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22007
[https://github.com/apache/spark/pull/22007]

> Bump Apache commons.{httpclient, httpcore}
> --
>
> Key: SPARK-25033
> URL: https://issues.apache.org/jira/browse/SPARK-25033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.0
>
>
> I would like to bump the versions to make it up to date with my other 
> dependencies, in my case Stocator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25090) java.lang.ClassCastException when using a CrossValidator

2018-08-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25090:


Assignee: Marco Gaido

> java.lang.ClassCastException when using a CrossValidator
> 
>
> Key: SPARK-25090
> URL: https://issues.apache.org/jira/browse/SPARK-25090
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
> Environment: Windows 10 64-bits, pyspark 2.3.1 on Anaconda.
>Reporter: Mark Morrisson
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> When I fit a LogisticRegression on a dataset, everything works fine but when 
> I fit a CrossValidator, I get this error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o1187.w.
> : java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Double
>  at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
>  at org.apache.spark.ml.param.DoubleParam.w(params.scala:330)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.lang.Thread.run(Unknown Source)
> Casting the target variable into double didn't solve the issue.
> Here is the snippet:
> lr = LogisticRegression(maxIter=10, labelCol="class", featuresCol="features", 
> rawPredictionCol="score")
> evaluator = BinaryClassificationEvaluator(rawPredictionCol="score", 
> labelCol="class", metricName="areaUnderROC")
> paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.01, 0.05, 0.1, 0.5, 
> 1]).build()
> crossval = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, 
> evaluator=evaluator, numFolds=3)
> bestModel = crossval.fit(train)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25090) java.lang.ClassCastException when using a CrossValidator

2018-08-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25090.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22076
[https://github.com/apache/spark/pull/22076]

> java.lang.ClassCastException when using a CrossValidator
> 
>
> Key: SPARK-25090
> URL: https://issues.apache.org/jira/browse/SPARK-25090
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
> Environment: Windows 10 64-bits, pyspark 2.3.1 on Anaconda.
>Reporter: Mark Morrisson
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> When I fit a LogisticRegression on a dataset, everything works fine but when 
> I fit a CrossValidator, I get this error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o1187.w.
> : java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Double
>  at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
>  at org.apache.spark.ml.param.DoubleParam.w(params.scala:330)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.lang.Thread.run(Unknown Source)
> Casting the target variable into double didn't solve the issue.
> Here is the snippet:
> lr = LogisticRegression(maxIter=10, labelCol="class", featuresCol="features", 
> rawPredictionCol="score")
> evaluator = BinaryClassificationEvaluator(rawPredictionCol="score", 
> labelCol="class", metricName="areaUnderROC")
> paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.01, 0.05, 0.1, 0.5, 
> 1]).build()
> crossval = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, 
> evaluator=evaluator, numFolds=3)
> bestModel = crossval.fit(train)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25069) Using UnsafeAlignedOffset to make the entire record of 8 byte Items aligned like which is used in UnsafeExternalSorter

2018-08-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25069:
---

Assignee: eaton

> Using UnsafeAlignedOffset to make the entire record of 8 byte Items aligned 
> like which is used in UnsafeExternalSorter 
> ---
>
> Key: SPARK-25069
> URL: https://issues.apache.org/jira/browse/SPARK-25069
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.1
>Reporter: eaton
>Assignee: eaton
>Priority: Major
> Fix For: 2.4.0
>
>
> The class of UnsafeExternalSorter used UnsafeAlignedOffset to make the entire 
> record of 8 byte Items aligned, but ShuffleExternalSorter not.
> The SPARC platform requires this because using a 4 byte Int for record 
> lengths causes the entire record of 8 byte Items to become misaligned by 4 
> bytes. Using a 8 byte long for record length keeps things 8 byte aligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25069) Using UnsafeAlignedOffset to make the entire record of 8 byte Items aligned like which is used in UnsafeExternalSorter

2018-08-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25069.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22053
[https://github.com/apache/spark/pull/22053]

> Using UnsafeAlignedOffset to make the entire record of 8 byte Items aligned 
> like which is used in UnsafeExternalSorter 
> ---
>
> Key: SPARK-25069
> URL: https://issues.apache.org/jira/browse/SPARK-25069
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.1
>Reporter: eaton
>Priority: Major
> Fix For: 2.4.0
>
>
> The class of UnsafeExternalSorter used UnsafeAlignedOffset to make the entire 
> record of 8 byte Items aligned, but ShuffleExternalSorter not.
> The SPARC platform requires this because using a 4 byte Int for record 
> lengths causes the entire record of 8 byte Items to become misaligned by 4 
> bytes. Using a 8 byte long for record length keeps things 8 byte aligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25084:

Fix Version/s: 2.3.2

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Assignee: yucai
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code}
> Exception:
> {code:java}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 131, Column 67: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 131, Column 67: One of ', )' expected instead of '['
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code}
> Wrong Codegen:
> {code:java}
> /* 131 */ private int computeHashForStruct_1(InternalRow 
> mutableStateArray[0], int value1) {
> /* 132 */
> /* 133 */
> /* 134 */ if (!mutableStateArray[0].isNullAt(5)) {
> /* 135 */
> /* 136 */ final int element5 = mutableStateArray[0].getInt(5);
> /* 137 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1);
> /* 138 */
> /* 139 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24774) support reading AVRO logical types - Decimal

2018-08-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24774.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22037
[https://github.com/apache/spark/pull/22037]

> support reading AVRO logical types - Decimal
> 
>
> Key: SPARK-24774
> URL: https://issues.apache.org/jira/browse/SPARK-24774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24774) support reading AVRO logical types - Decimal

2018-08-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24774:
---

Assignee: Gengliang Wang

> support reading AVRO logical types - Decimal
> 
>
> Key: SPARK-24774
> URL: https://issues.apache.org/jira/browse/SPARK-24774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-12 Thread Mateusz Jukiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577741#comment-16577741
 ] 

Mateusz Jukiewicz commented on SPARK-23298:
---

[~tgraves]

I edited the issue description and added a reproducible example which you can 
try out. Please keep in mind it might take several spark session of "distinct 
counting" to observe.

On the other hand, I tried and cannot reproduce the issue on Spark 2.3.1. 
Therefore the aforementioned SPARK-23207 could have fixed this one as well. 
Obviously, it would probably be the best if you managed to reproduce on Spark 
2.2 and confirm SPARK-23207 fixes this one as well. But if you guys don't have 
time for that, I'm fine with closing this one right away.

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, correctness
>
> This is what happens (EDIT - managed to get a reproducible example):
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> // The spark.sql.shuffle.partitions is 2154 here, if that matters
> */
> val df = spark.range(1000).withColumn("col1", (rand() * 
> 1000).cast("long")).withColumn("col2", (rand() * 
> 1000).cast("long")).drop("id")
> df.repartition(5240).write.parquet("/test.parquet")
> // Then, ideally in a new session
> val df = spark.read.parquet("/test.parquet")
> df.distinct.count
> // res1: Long = 1001256                                                       
>      
> df.distinct.count
> // res2: Long = 55   {code}
> -The _text_dataset.out_ file is a dataset with one string per line. The 
> string has alphanumeric characters as well as colons and spaces. The line 
> length does not exceed 1200. I don't think that's important though, as the 
> issue appeared on various other datasets, I just tried to narrow it down to 
> the simplest possible case.- (the case is now fully reproducible with the 
> above code)
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> read.textFile).
>  * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
>  * Not a single container has failed in those multiple invalid executions.
>  * YARN doesn't show any warnings or errors in those invalid executions.
>  * The execution plan determined for both valid and invalid executions was 
> always the same (it's shown in the _SQL_ tab of the UI).
>  * The number returned in the invalid executions was always greater than the 
> correct number (24 014 227).
>  * This occurs even though the input is already completely deduplicated (i.e. 
> _distinct.count_ shouldn't change anything).
>  * The input isn't replicated (i.e. there's only one copy of each file block 
> on the HDFS).
>  * The problem is probably not related to reading from HDFS. Spark was always 
> able to correctly read all input records (which was shown in the UI), and 
> that number got malformed after the exchange phase:
>  ** correct execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24014227 _(second stage)_
>  ** incorrect execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 

[jira] [Updated] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-12 Thread Mateusz Jukiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mateusz Jukiewicz updated SPARK-23298:
--
Description: 
This is what happens (EDIT - managed to get a reproducible example):
{code:java}
/* Exemplary spark-shell starting command 
/opt/spark/bin/spark-shell \
--num-executors 269 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m 

// The spark.sql.shuffle.partitions is 2154 here, if that matters
*/

val df = spark.range(1000).withColumn("col1", (rand() * 
1000).cast("long")).withColumn("col2", (rand() * 1000).cast("long")).drop("id")
df.repartition(5240).write.parquet("/test.parquet")

// Then, ideally in a new session
val df = spark.read.parquet("/test.parquet")
df.distinct.count
// res1: Long = 1001256                                                         
   
df.distinct.count
// res2: Long = 55   {code}
-The _text_dataset.out_ file is a dataset with one string per line. The string 
has alphanumeric characters as well as colons and spaces. The line length does 
not exceed 1200. I don't think that's important though, as the issue appeared 
on various other datasets, I just tried to narrow it down to the simplest 
possible case.- (the case is now fully reproducible with the above code)

The observations regarding the issue are as follows:
 * I managed to reproduce it on both spark 2.2 and spark 2.1.
 * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
 * The issue is not reproducible on a single machine (e.g. laptop) in spark 
local mode.
 * It seems that once the correct count is computed, it is not possible to 
reproduce the issue in the same spark session. In other words, I was able to 
get 2-3 incorrect distinct.count results consecutively, but once it got right, 
it always returned the correct value. I had to re-run spark-shell to observe 
the problem again.
 * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
read.textFile).
 * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
 * Not a single container has failed in those multiple invalid executions.
 * YARN doesn't show any warnings or errors in those invalid executions.
 * The execution plan determined for both valid and invalid executions was 
always the same (it's shown in the _SQL_ tab of the UI).
 * The number returned in the invalid executions was always greater than the 
correct number (24 014 227).
 * This occurs even though the input is already completely deduplicated (i.e. 
_distinct.count_ shouldn't change anything).
 * The input isn't replicated (i.e. there's only one copy of each file block on 
the HDFS).
 * The problem is probably not related to reading from HDFS. Spark was always 
able to correctly read all input records (which was shown in the UI), and that 
number got malformed after the exchange phase:
 ** correct execution:
 Input Size / Records: 3.9 GB / 24014227 _(first stage)_
 Shuffle Write: 3.3 GB / 24014227 _(first stage)_
 Shuffle Read: 3.3 GB / 24014227 _(second stage)_
 ** incorrect execution:
 Input Size / Records: 3.9 GB / 24014227 _(first stage)_
 Shuffle Write: 3.3 GB / 24014227 _(first stage)_
 Shuffle Read: 3.3 GB / 24020150 _(second stage)_
 * The problem might be related with the internal way of Encoders hashing. The 
reason might be:
 ** in a simple `distinct.count` invocation, there are in total three 
hash-related stages (called `HashAggregate`),
 ** excerpt from scaladoc for `distinct` method says:
{code:java}
   * @note Equality checking is performed directly on the encoded 
representation of the data
   * and thus is not affected by a custom `equals` function defined on 
`T`.{code}

 * One of my suspicions was the number of partitions we're using (2154). This 
is greater than 2000, which means that a different data structure (i.e. 
_HighlyCompressedMapStatus_instead of _CompressedMapStatus_) will be used for 
book-keeping during the shuffle. Unfortunately after decreasing the number 
below this threshold the problem still occurs.
 * It's easier to reproduce the issue with a large number of partitions.
 * One of my another suspicions was that it's somehow related to the number of 
blocks on the HDFS (974). I was able to reproduce the problem with both less 
and more partitions than this value, so I think this is not the case.
 * Final note: It looks like for some reason the data gets duplicated in the 
process of data exchange during the shuffle (because shuffle read sees more 
elements than shuffle write has written).

Please let me know if you have any other questions.

I couldn't find much about similar problems on the Web, the only thing I found 
was on the spark mailing list where someone using PySpark has found that one of 
his/her executors was hashing things differently than the other one which 
caused a similar issue.

I didn't include a 

[jira] [Commented] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-12 Thread Mateusz Jukiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577734#comment-16577734
 ] 

Mateusz Jukiewicz commented on SPARK-23298:
---

[~tgraves]

I have not yet tried to reproduce on Spark 2.3, but will do and will let you 
know if it helped.

I'm not doing any explicit repartitions, but the `distinct` operation does 
trigger a shuffle and will send the data through the wire. Additionally, if the 
number of partitions in source dataframe is different than 
`spark.sql.shuffle.partitions`, then the number of partitions will change as 
well. Not sure if any of these can be called "repartitioning", but the data is 
exchanged between the nodes.

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, correctness
>
> This is what happens:
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> */
> val dataset = spark.read.textFile("/text_dataset.out")
> dataset.distinct.count
> // res0: Long = 24025868
> dataset.distinct.count
> // res1: Long = 24014227{code}
> The _text_dataset.out_ file is a dataset with one string per line. The string 
> has alphanumeric characters as well as colons and spaces. The line length 
> does not exceed 1200. I don't think that's important though, as the issue 
> appeared on various other datasets, I just tried to narrow it down to the 
> simplest possible case.
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> read.textFile).
>  * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
>  * Not a single container has failed in those multiple invalid executions.
>  * YARN doesn't show any warnings or errors in those invalid executions.
>  * The execution plan determined for both valid and invalid executions was 
> always the same (it's shown in the _SQL_ tab of the UI).
>  * The number returned in the invalid executions was always greater than the 
> correct number (24 014 227).
>  * This occurs even though the input is already completely deduplicated (i.e. 
> _distinct.count_ shouldn't change anything).
>  * The input isn't replicated (i.e. there's only one copy of each file block 
> on the HDFS).
>  * The problem is probably not related to reading from HDFS. Spark was always 
> able to correctly read all input records (which was shown in the UI), and 
> that number got malformed after the exchange phase:
>  ** correct execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24014227 _(second stage)_
>  ** incorrect execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24020150 _(second stage)_
>  * The problem might be related with the internal way of Encoders hashing. 
> The reason might be:
>  ** in a simple `distinct.count` invocation, there are in total three 
> hash-related stages (called `HashAggregate`),
>  ** excerpt from scaladoc for `distinct` method says:
> {code:java}
>* @note Equality checking is performed directly on the encoded 
> representation of the data
>* and thus is not affected by a custom `equals` function defined on 
> `T`.{code}
>  * One 

[jira] [Assigned] (SPARK-25026) Binary releases should contain some copy of compiled external integration modules

2018-08-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25026:


Assignee: Apache Spark  (was: Sean Owen)

> Binary releases should contain some copy of compiled external integration 
> modules
> -
>
> Key: SPARK-25026
> URL: https://issues.apache.org/jira/browse/SPARK-25026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25026) Binary releases should contain some copy of compiled external integration modules

2018-08-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577685#comment-16577685
 ] 

Apache Spark commented on SPARK-25026:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22084

> Binary releases should contain some copy of compiled external integration 
> modules
> -
>
> Key: SPARK-25026
> URL: https://issues.apache.org/jira/browse/SPARK-25026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25026) Binary releases should contain some copy of compiled external integration modules

2018-08-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25026:


Assignee: Sean Owen  (was: Apache Spark)

> Binary releases should contain some copy of compiled external integration 
> modules
> -
>
> Key: SPARK-25026
> URL: https://issues.apache.org/jira/browse/SPARK-25026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25025) Remove the default value of isAll in INTERSECT/EXCEPT

2018-08-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25025:
--
Comment: was deleted

(was: User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22084)

> Remove the default value of isAll in INTERSECT/EXCEPT
> -
>
> Key: SPARK-25025
> URL: https://issues.apache.org/jira/browse/SPARK-25025
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Having the default value of isAll in the logical plan nodes INTERSECT/EXCEPT 
> could introduce bugs when the callers are not aware of it. Let us get rid of 
> the default value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25095) Python support for BarrierTaskContext

2018-08-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25095:


Assignee: Apache Spark

> Python support for BarrierTaskContext
> -
>
> Key: SPARK-25095
> URL: https://issues.apache.org/jira/browse/SPARK-25095
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25095) Python support for BarrierTaskContext

2018-08-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25095:


Assignee: (was: Apache Spark)

> Python support for BarrierTaskContext
> -
>
> Key: SPARK-25095
> URL: https://issues.apache.org/jira/browse/SPARK-25095
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25095) Python support for BarrierTaskContext

2018-08-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577639#comment-16577639
 ] 

Apache Spark commented on SPARK-25095:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/22085

> Python support for BarrierTaskContext
> -
>
> Key: SPARK-25095
> URL: https://issues.apache.org/jira/browse/SPARK-25095
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25025) Remove the default value of isAll in INTERSECT/EXCEPT

2018-08-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577623#comment-16577623
 ] 

Apache Spark commented on SPARK-25025:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22084

> Remove the default value of isAll in INTERSECT/EXCEPT
> -
>
> Key: SPARK-25025
> URL: https://issues.apache.org/jira/browse/SPARK-25025
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Having the default value of isAll in the logical plan nodes INTERSECT/EXCEPT 
> could introduce bugs when the callers are not aware of it. Let us get rid of 
> the default value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-12 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577593#comment-16577593
 ] 

Marco Gaido commented on SPARK-25094:
-

This is a duplicate of many. Unfortunately this problem has not yet been 
solved, so in this case whole-stage code generation is disabled for the query. 
There is an ongoing effort in order to enable to fix this issue in the future 
though.

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25095) Python support for BarrierTaskContext

2018-08-12 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-25095:


 Summary: Python support for BarrierTaskContext
 Key: SPARK-25095
 URL: https://issues.apache.org/jira/browse/SPARK-25095
 Project: Spark
  Issue Type: Task
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-08-12 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577502#comment-16577502
 ] 

Yuming Wang edited comment on SPARK-24631 at 8/12/18 11:18 AM:
---

I hit this issue. fixed it by recreate table/view. Can you execute 2 SQLs like 
below:

 
{code:sql}
desc testtable;
{code}
{code:sql}
show create table testtable;
{code}
 


was (Author: q79969786):
I hit this issue. fixed it by recreate table/view. Can you execute 2 SQLs like 
below:

 
{code:java}
desc testtable;
{code}
{code:java}
show create table testtable;
{code}
 

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-08-12 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577502#comment-16577502
 ] 

Yuming Wang commented on SPARK-24631:
-

I hit this issue. fixed it by recreate table/view. Can you execute 2 SQLs like 
below:

 
{code:java}
desc testtable;
{code}
{code:java}
show create table testtable;
{code}
 

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-12 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577401#comment-16577401
 ] 

Yuming Wang edited comment on SPARK-25051 at 8/12/18 10:51 AM:
---

Can you verify it with Spark [2.3.2-rc4 
|https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc4-bin/]?


was (Author: q79969786):
Can you it with Spark [2.3.2-rc4 
|https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc4-bin/]?

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2018-08-12 Thread Minh Thai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577449#comment-16577449
 ] 

Minh Thai edited comment on SPARK-17368 at 8/12/18 8:22 AM:


[~jodersky] I know that this is an old ticket but I still want to give some 
comments on making encoder for value classes. Even until today, there is no way 
to have a type constraint that targets value classes. However, I think we can 
make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code:java}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
tested on my machine using Spark 2.1.0 and Scala 2.11.12, this doesn't clash 
with the existing encoder for case class
{code:java}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
If this is possible to implement. I think it can solve SPARK-20384 also.

_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_


was (Author: mthai):
[~jodersky] I know that this is an old ticket but I still want to give some 
comments on making encoder for value classes. Even until today, there is no way 
to have a type constraint that targets value classes. However, I think we can 
make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code:java}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
tested on my machine using Spark 2.1.0 and Scala 2.11.12, this doesn't clash 
with the existing encoder for case class
{code:java}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
_If this is possible to implement. I think it can solve SPARK-20384 also._

_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>Assignee: Jakob Odersky
>Priority: Major
> Fix For: 2.1.0
>
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2018-08-12 Thread Minh Thai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577449#comment-16577449
 ] 

Minh Thai edited comment on SPARK-17368 at 8/12/18 8:21 AM:


[~jodersky] I know that this is an old ticket but I still want to give some 
comments on making encoder for value classes. Even until today, there is no way 
to have a type constraint that targets value classes. However, I think we can 
make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code:java}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
tested on my machine using Spark 2.1.0 and Scala 2.11.12, this doesn't clash 
with the existing encoder for case class
{code:java}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
 

_If this is possible to implement. I think it can solve SPARK-20384 also._

_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_


was (Author: mthai):
[~jodersky] I know that this is an old ticket but I still want to give some 
comments on making encoder for value classes. Even until today, there is no way 
to have a type constraint that targets value classes. However, I think we can 
make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
tested on my machine using Spark 2.1.0 and Scala 2.11.12, this doesn't clash 
with the existing encoder for case class
{code}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>Assignee: Jakob Odersky
>Priority: Major
> Fix For: 2.1.0
>
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2018-08-12 Thread Minh Thai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577449#comment-16577449
 ] 

Minh Thai edited comment on SPARK-17368 at 8/12/18 8:21 AM:


[~jodersky] I know that this is an old ticket but I still want to give some 
comments on making encoder for value classes. Even until today, there is no way 
to have a type constraint that targets value classes. However, I think we can 
make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code:java}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
tested on my machine using Spark 2.1.0 and Scala 2.11.12, this doesn't clash 
with the existing encoder for case class
{code:java}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
_If this is possible to implement. I think it can solve SPARK-20384 also._

_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_


was (Author: mthai):
[~jodersky] I know that this is an old ticket but I still want to give some 
comments on making encoder for value classes. Even until today, there is no way 
to have a type constraint that targets value classes. However, I think we can 
make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code:java}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
tested on my machine using Spark 2.1.0 and Scala 2.11.12, this doesn't clash 
with the existing encoder for case class
{code:java}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
 

_If this is possible to implement. I think it can solve SPARK-20384 also._

_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>Assignee: Jakob Odersky
>Priority: Major
> Fix For: 2.1.0
>
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2018-08-12 Thread Minh Thai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577449#comment-16577449
 ] 

Minh Thai commented on SPARK-17368:
---

[~jodersky] I know that this is an old ticket but I still want to give some 
comments on making encoder for value classes. Even until today, there is no way 
to have a type constraint that targets value classes. However, I think we can 
make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
tested on my machine using Spark 2.1.0 and Scala 2.11.12, this doesn't clash 
with the existing encoder for case class
{code}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>Assignee: Jakob Odersky
>Priority: Major
> Fix For: 2.1.0
>
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-12 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-25094:
---

 Summary: proccesNext() failed to compile size is over 64kb
 Key: SPARK-25094
 URL: https://issues.apache.org/jira/browse/SPARK-25094
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Izek Greenfield


I have this tree:

2018-08-12T07:14:31,289 WARN  [] 
org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
disabled for plan (id=1):
 *(1) Project [, ... 10 more fields]
+- *(1) Filter NOT exposure_calc_method#10141 IN 
(UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
   +- InMemoryTableScan [, ... 11 more fields], [NOT exposure_calc_method#10141 
IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
 +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
deserialized, 1 replicas)
   +- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
  :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
  :  +- Exchange(coordinator id: 1456511137) 
UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
67108864]
  : +- *(1) Project [, ... 6 more fields]
  :+- *(1) Filter (isnotnull(v#49) && 
isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
&& (v#49 = DATA_REG)) && isnotnull(unique_id#39))
  :   +- InMemoryTableScan [, ... 6 more fields], [, 
... 6 more fields]
  : +- InMemoryRelation [, ... 6 more fields], 
StorageLevel(memory, deserialized, 1 replicas)
  :   +- *(1) FileScan csv [,... 6 more 
fields] , ... 6 more fields
  +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
 +- Exchange(coordinator id: 1456511137) 
UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
67108864]
+- *(3) Project [, ... 74 more fields]
   +- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
<=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
  +- InMemoryTableScan [, ... 74 more fields], [, 
... 4 more fields]
+- InMemoryRelation [, ... 74 more fields], 
StorageLevel(memory, deserialized, 1 replicas)
  +- *(1) FileScan csv [,... 74 more 
fields] , ... 6 more fields

Compiling "GeneratedClass": Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB

and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25093) CodeFormatter could avoid creating regex object again and again

2018-08-12 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-25093:
---

 Summary: CodeFormatter could avoid creating regex object again and 
again
 Key: SPARK-25093
 URL: https://issues.apache.org/jira/browse/SPARK-25093
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Izek Greenfield


in class `CodeFormatter` 

method: `stripExtraNewLinesAndComments`

could be refactored to: 


{code:scala}
// Some comments here
 val commentReg =
("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
  """([ |\t]*?\/\/[\s\S]*?\n)""").r  // strip //comment
  val emptyRowsReg = """\n\s*\n""".r
def stripExtraNewLinesAndComments(input: String): String = {
val codeWithoutComment = commentReg.replaceAllIn(input, "")
emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines
  }
{code}

so the Regex would be compiled only once.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org