[jira] [Commented] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?
[ https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607019#comment-16607019 ] Marco Gaido commented on SPARK-25364: - Seems you created 2 JIRAs which are the same, if that is the case, can you close this or the next one? Thanks. > a better way to handle vector index and sparsity in FeatureHasher > implementation ? > -- > > Key: SPARK-25364 > URL: https://issues.apache.org/jira/browse/SPARK-25364 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be updated with the sum of > current and old value, ie, the value of the conflicted feature vector would > be change by this module. > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training > we are working on fixing these problems due to our business need, thinking it > might or might not be an issue for others as well, we'd like to hear from the > community. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604133#comment-16604133 ] Marco Gaido commented on SPARK-25317: - [~kiszk] sure, we can investigate further in the PR the root cause. Thanks. > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603174#comment-16603174 ] Marco Gaido commented on SPARK-25317: - I think I have a fix for this. I can submit a PR if you want, but I am still not sure about the root cause of the regression. My best guess is that there are more than one reason and the perf improvement happens iff all the reasons are fixed, which is rather strange to me. > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (LIVY-506) Dedicated thread for timeout checker
Marco Gaido created LIVY-506: Summary: Dedicated thread for timeout checker Key: LIVY-506 URL: https://issues.apache.org/jira/browse/LIVY-506 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido The timeout checker task currently uses the backgroud pool of threads. Since the thread is always alive, this doesn't make a lot of sense as it is using forever a thread from the pool. It should use, instead, its dedicated thread. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-25265) Fix memory leak vulnerability in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596221#comment-16596221 ] Marco Gaido commented on SPARK-25265: - Isn't this a duplicate of the next one? > Fix memory leak vulnerability in Barrier Execution Mode > --- > > Key: SPARK-25265 > URL: https://issues.apache.org/jira/browse/SPARK-25265 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect
[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596109#comment-16596109 ] Marco Gaido commented on SPARK-25219: - Well, there are many differences between Spark ML and SKLearn codes you've posted. First of all the number of clusters is different. Moreover the input data to KMeans can be different. Please store the data after the TF-IDF transformation, which is the interesting one. Then, take the KMeans results and the centroids: check if the distance of a point to the centroid it has been assigned to is lower than the distance to all the other centroids. If that is the case, there is no issue with KMeans, You may have to increase the number of runs, change the initialization method, change the seed and so on to get a different result, but there is no evident bug in the algorithm itself. If this is not the case, instead, with the input data to the KMeans and the reproducer, I can investigate the problem. Thanks. > KMeans Clustering - Text Data - Results are incorrect > - > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Vasanthkumar Velayudham >Priority: Major > Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, > Spark_Kmeans.txt > > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23622) Flaky Test: HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595068#comment-16595068 ] Marco Gaido commented on SPARK-23622: - This failure became permanent in the last build (at least seems so). > Flaky Test: HiveClientSuites > > > Key: SPARK-23622 > URL: https://issues.apache.org/jira/browse/SPARK-23622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test > (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325 > {code} > Error Message > java.lang.reflect.InvocationTargetException: null > Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270) > at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58) > at > org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41) > at > org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48) > at > org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) > at > org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24) > at org.scalatest.Suite$class.run(Suite.scala:1144) > at > org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117) > ... 29 more > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425) > ... 31 more > Caused by: sbt.ForkMain$ForkError: >
[jira] [Created] (LIVY-503) More RPC classes used in thrifserver in a separate module
Marco Gaido created LIVY-503: Summary: More RPC classes used in thrifserver in a separate module Key: LIVY-503 URL: https://issues.apache.org/jira/browse/LIVY-503 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido As suggested in the discussion for the original PR (https://github.com/apache/incubator-livy/pull/104#discussion_r212806490), we should move the RPC classes which need to be uploaded to the Spark session in a separate module, in order to upload as few classes as possible and avoid eventual interaction with the Spark session created. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-502) Cleanup Hive dependencies
Marco Gaido created LIVY-502: Summary: Cleanup Hive dependencies Key: LIVY-502 URL: https://issues.apache.org/jira/browse/LIVY-502 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido In the starting implementation we are relying/delegating some of the work to the Hive classes used in the HiveServer2. This helped simplifying the creation of the first implementation, as it saved to write a lot of code. But this caused also a dependency on the {{hive-exec}} package, as well as compelled us to modify a bit some of the existing Hive classes. The JIRA tracks removing these workarounds by re-implementing the same logic in Livy to get rid of all Hive dependencies, other than the rpc and service layers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails
[ https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592609#comment-16592609 ] Marco Gaido commented on SPARK-25193: - Well, this I think is HIVE-12505. So it would need to be fixed in the hive version which is shipped in Spark... > insert overwrite doesn't throw exception when drop old data fails > - > > Key: SPARK-25193 > URL: https://issues.apache.org/jira/browse/SPARK-25193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: chen xiao >Priority: Major > > dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName") > Insert overwrite mode will drop old data in hive table if there's old data. > But if data deleting fails, no exception will be thrown and the data folder > will be like: > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0 > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513. > Two copies of data will be kept. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect
[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591423#comment-16591423 ] Marco Gaido commented on SPARK-25219: - Hi [~VVasanth], a JIRA like this is very difficult to work on: saying that something returns a result which is not the expected one is not a great starting point for taking an action. It would be great if you could provide a simple reproducer. The reproducer needs to involve only one thing if possible (in this case KMeans, not involving other transformation), with a set of parameters to reproduce the problem and the expected result which is returned with the same parameters by the other libraries. If the problem is more clear, I am happy to work on it, but first we need to understand whether this is indeed an issue and how to reproduce it. Thanks. > KMeans Clustering - Text Data - Results are incorrect > - > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Vasanthkumar Velayudham >Priority: Major > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect
[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25219: Component/s: (was: Spark Submit) ML > KMeans Clustering - Text Data - Results are incorrect > - > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Vasanthkumar Velayudham >Priority: Major > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25146) avg() returns null on some decimals
[ https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584016#comment-16584016 ] Marco Gaido commented on SPARK-25146: - No problem, thanks for reporting this anyway. > avg() returns null on some decimals > --- > > Key: SPARK-25146 > URL: https://issues.apache.org/jira/browse/SPARK-25146 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Daniel Darabos >Priority: Major > > We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average > them. The average in some cases comes out to {{null}} to our surprise (and > disappointment). > After a bit of digging it looks like these numbers have ended up with the > {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with > this type: > {code} > scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x") > scala> spark.sql("select cast(value as decimal(37, 30)) as v from > x").createOrReplaceTempView("x") > scala> spark.sql("select avg(v) from x").show > +--+ > |avg(v)| > +--+ > | null| > +--+ > {code} > For up to 4471 numbers it is able to calculate the average. For 4472 or more > numbers it's {{null}}. > Now I'll just change these numbers to {{double}}. But we got the types > entirely automatically. We never asked for {{decimal}}. If this is the > default type, it's important to support averaging a handful of them. (Sorry > for the bitterness. I like {{double}} more. :)) > Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise > that {{avg()}} fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25146) avg() returns null on some decimals
[ https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583936#comment-16583936 ] Marco Gaido commented on SPARK-25146: - This has been fixed by SPARK-24957. On the current master doesn't repro. I can only advice upgrading to 2.3.2 or 2.4.0 once they are available (probably not too much). I am closing this as a duplicate. Please reopen if anything else is needed. Thanks. > avg() returns null on some decimals > --- > > Key: SPARK-25146 > URL: https://issues.apache.org/jira/browse/SPARK-25146 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Daniel Darabos >Priority: Major > > We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average > them. The average in some cases comes out to {{null}} to our surprise (and > disappointment). > After a bit of digging it looks like these numbers have ended up with the > {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with > this type: > {code} > scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x") > scala> spark.sql("select cast(value as decimal(37, 30)) as v from > x").createOrReplaceTempView("x") > scala> spark.sql("select avg(v) from x").show > +--+ > |avg(v)| > +--+ > | null| > +--+ > {code} > For up to 4471 numbers it is able to calculate the average. For 4472 or more > numbers it's {{null}}. > Now I'll just change these numbers to {{double}}. But we got the types > entirely automatically. We never asked for {{decimal}}. If this is the > default type, it's important to support averaging a handful of them. (Sorry > for the bitterness. I like {{double}} more. :)) > Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise > that {{avg()}} fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25146) avg() returns null on some decimals
[ https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-25146. - Resolution: Duplicate > avg() returns null on some decimals > --- > > Key: SPARK-25146 > URL: https://issues.apache.org/jira/browse/SPARK-25146 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Daniel Darabos >Priority: Major > > We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average > them. The average in some cases comes out to {{null}} to our surprise (and > disappointment). > After a bit of digging it looks like these numbers have ended up with the > {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with > this type: > {code} > scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x") > scala> spark.sql("select cast(value as decimal(37, 30)) as v from > x").createOrReplaceTempView("x") > scala> spark.sql("select avg(v) from x").show > +--+ > |avg(v)| > +--+ > | null| > +--+ > {code} > For up to 4471 numbers it is able to calculate the average. For 4472 or more > numbers it's {{null}}. > Now I'll just change these numbers to {{double}}. But we got the types > entirely automatically. We never asked for {{decimal}}. If this is the > default type, it's important to support averaging a handful of them. (Sorry > for the bitterness. I like {{double}} more. :)) > Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise > that {{avg()}} fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True
[ https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583928#comment-16583928 ] Marco Gaido commented on SPARK-25145: - cc [~dongjoon] > Buffer size too small on spark.sql query with filterPushdown predicate=True > --- > > Key: SPARK-25145 > URL: https://issues.apache.org/jira/browse/SPARK-25145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 > Environment: > {noformat} > # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018 > spark.driver.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.eventLog.dir hdfs:///spark2-history/ > spark.eventLog.enabled true > spark.executor.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.hadoop.hive.vectorized.execution.enabled true > spark.history.fs.logDirectory hdfs:///spark2-history/ > spark.history.kerberos.keytab none > spark.history.kerberos.principal none > spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider > spark.history.retainedApplications 50 > spark.history.ui.port 18081 > spark.io.compression.lz4.blockSize 128k > spark.locality.wait 2s > spark.network.timeout 600s > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.shuffle.consolidateFiles true > spark.shuffle.io.numConnectionsPerPeer 10 > spark.sql.autoBroadcastJoinTreshold 26214400 > spark.sql.shuffle.partitions 300 > spark.sql.statistics.fallBack.toHdfs true > spark.sql.tungsten.enabled true > spark.driver.memoryOverhead 2048 > spark.executor.memoryOverhead 4096 > spark.yarn.historyServer.address service-10-4.local:18081 > spark.yarn.queue default > spark.sql.warehouse.dir hdfs:///apps/hive/warehouse > spark.sql.execution.arrow.enabled true > spark.sql.hive.convertMetastoreOrc true > spark.sql.orc.char.enabled true > spark.sql.orc.enabled true > spark.sql.orc.filterPushdown true > spark.sql.orc.impl native > spark.sql.orc.enableVectorizedReader true > spark.yarn.jars hdfs:///apps/spark-jars/231/jars/* > {noformat} > >Reporter: Bjørnar Jensen >Priority: Minor > Attachments: create_bug.py, report.txt > > > java.lang.IllegalArgumentException: Buffer size too small. size = 262144 > needed = 2205991 > # > {code:java} > Python > import numpy as np > import pandas as pd > # Create a spark dataframe > df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0}) > sdf = spark.createDataFrame(df) > print('Created spark dataframe:') > sdf.show() > # Save table as orc > sdf.write.saveAsTable(format='orc', mode='overwrite', > name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', > compression='zlib') > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Fetch entire table (works) > print('Read entire table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown').show() > # Ensure filterPushdown is disabled > spark.conf.set('spark.sql.orc.filterPushdown', False) > # Query without filterPushdown (works) > print('Read a selection from table with "filterPushdown"=False') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Query with filterPushDown (fails) > print('Read a selection from table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > {code} > {noformat} > ~/bug_report $ pyspark > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port > 4040. Attempting port 4041. > Jupyter console 5.1.0 > Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28) > Type 'copyright', 'credits' or 'license' for more information > IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help. > In [1]: %run -i create_bug.py > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT > /_/ > Using Python version 3.6.3 (default, May 4 2018 04:22:28) > SparkSession available as 'spark'. > Created spark dataframe: > +---+---+ > | a| b| > +---+---+ > | 0|0.0| > | 1|0.5| > | 2|1.0| > | 3|1.5| > | 4|2.0| > | 5|2.5| > | 6|3.0| > | 7|3.5| > | 8|4.0| > | 9|4.5| > +---+---+ > Read entire table with "filterPushdown"=True > +---+---+ > | a| b| > +---+---+ > | 1|0.5| > | 2|1.0| > | 3|1.5| > | 5|2.5| > | 6|3.0| > | 7|3.5| > | 8|4.0| > | 9|4.5| > | 4|2.0| > |
[jira] [Commented] (SPARK-25138) Spark Shell should show the Scala prompt after initialization is complete
[ https://issues.apache.org/jira/browse/SPARK-25138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583585#comment-16583585 ] Marco Gaido commented on SPARK-25138: - [~smilegator] this is caused by SPARK-24418 and it is a duplicate of SPARK-24785, for which there is a PR. cc [~dbtsai] I am closing this as a duplicate. Please reopen if needed. Thanks. > Spark Shell should show the Scala prompt after initialization is complete > - > > Key: SPARK-25138 > URL: https://issues.apache.org/jira/browse/SPARK-25138 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > In previous Spark versions, the Spark Shell used to only show the Scala > prompt *after* Spark has initialized. i.e. when the user is able to enter > code, the Spark context, Spark session etc have all completed initialization, > so {{sc}}, {{spark}} are all ready to use. > In the current Spark master branch (to become Spark 2.4.0), the Scala prompt > shows up immediately, while Spark itself is still in initialization in the > background. It's very easy for the user to feel as if the shell is ready and > start typing, only to find that Spark isn't ready yet, and Spark's > initialization logs get in the way of typing. This new behavior is rather > annoying from a usability's perspective. > A typical startup of the Spark Shell in current master: > {code:none} > $ bin/spark-shell > 18/08/16 23:18:05 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT > /_/ > > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_131) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.range(1)Spark context Web UI available at http://localhost:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1534486692744). > Spark session available as 'spark'. > .show > +---+ > | id| > +---+ > | 0| > +---+ > scala> > {code} > Could you see that it was running {{spark.range(1).show}} ? > In contrast, previous versions of Spark Shell would wait for Spark to fully > initialization: > {code:none} > $ bin/spark-shell > 18/08/16 23:20:05 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://10.0.0.76:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1534486813159). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.range(1).show > +---+ > | id| > +---+ > | 0| > +---+ > scala> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25138) Spark Shell should show the Scala prompt after initialization is complete
[ https://issues.apache.org/jira/browse/SPARK-25138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-25138. - Resolution: Duplicate > Spark Shell should show the Scala prompt after initialization is complete > - > > Key: SPARK-25138 > URL: https://issues.apache.org/jira/browse/SPARK-25138 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > In previous Spark versions, the Spark Shell used to only show the Scala > prompt *after* Spark has initialized. i.e. when the user is able to enter > code, the Spark context, Spark session etc have all completed initialization, > so {{sc}}, {{spark}} are all ready to use. > In the current Spark master branch (to become Spark 2.4.0), the Scala prompt > shows up immediately, while Spark itself is still in initialization in the > background. It's very easy for the user to feel as if the shell is ready and > start typing, only to find that Spark isn't ready yet, and Spark's > initialization logs get in the way of typing. This new behavior is rather > annoying from a usability's perspective. > A typical startup of the Spark Shell in current master: > {code:none} > $ bin/spark-shell > 18/08/16 23:18:05 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT > /_/ > > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_131) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.range(1)Spark context Web UI available at http://localhost:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1534486692744). > Spark session available as 'spark'. > .show > +---+ > | id| > +---+ > | 0| > +---+ > scala> > {code} > Could you see that it was running {{spark.range(1).show}} ? > In contrast, previous versions of Spark Shell would wait for Spark to fully > initialization: > {code:none} > $ bin/spark-shell > 18/08/16 23:20:05 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://10.0.0.76:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1534486813159). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.range(1).show > +---+ > | id| > +---+ > | 0| > +---+ > scala> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25093) CodeFormatter could avoid creating regex object again and again
[ https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582543#comment-16582543 ] Marco Gaido commented on SPARK-25093: - [~igreenfi] do you want to submit a PR for this? Otherwise I can do it. Thanks. > CodeFormatter could avoid creating regex object again and again > --- > > Key: SPARK-25093 > URL: https://issues.apache.org/jira/browse/SPARK-25093 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Minor > > in class `CodeFormatter` > method: `stripExtraNewLinesAndComments` > could be refactored to: > {code:scala} > // Some comments here > val commentReg = > ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/ > """([ |\t]*?\/\/[\s\S]*?\n)""").r // strip //comment > val emptyRowsReg = """\n\s*\n""".r > def stripExtraNewLinesAndComments(input: String): String = { > val codeWithoutComment = commentReg.replaceAllIn(input, "") > emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines > } > {code} > so the Regex would be compiled only once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25031) The schema of MapType can not be printed correctly
[ https://issues.apache.org/jira/browse/SPARK-25031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582531#comment-16582531 ] Marco Gaido commented on SPARK-25031: - ^ kindly ping [~smilegator] > The schema of MapType can not be printed correctly > -- > > Key: SPARK-25031 > URL: https://issues.apache.org/jira/browse/SPARK-25031 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Hao Ren >Priority: Minor > Labels: easyfix > > Something wrong with the function `buildFormattedString` in `MapType` > > {code:java} > import spark.implicits._ > case class Key(a: Int) > case class Value(b: Int) > Seq( > (1, Map(Key(1) -> Value(2))), > (2, Map(Key(1) -> Value(2))) > ).toDF("id", "dict").printSchema > {code} > The result is: > {code:java} > root > |-- id: integer (nullable = false) > |-- dict: map (nullable = true) > | |-- key: struct > | |-- value: struct (valueContainsNull = true) > | | |-- a: integer (nullable = false) > | | |-- b: integer (nullable = false) > {code} > The expected is > {code:java} > root > |-- id: integer (nullable = false) > |-- dict: map (nullable = true) > | |-- key: struct > | | |-- a: integer (nullable = false) > | |-- value: struct (valueContainsNull = true) > | | |-- b: integer (nullable = false) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets
[ https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582383#comment-16582383 ] Marco Gaido edited comment on SPARK-25125 at 8/16/18 1:07 PM: -- I think his may be a duplicate of SPARK-24013. [~myali] may you please try and check whether current master still have the issue? If was (Author: mgaido): I think his may be a duplicate of SPARK-25125. [~myali] may you please try and check whether current master still have the issue? If > Spark SQL percentile_approx takes longer than Hive version for large datasets > - > > Key: SPARK-25125 > URL: https://issues.apache.org/jira/browse/SPARK-25125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Mir Ali >Priority: Major > > The percentile_approx function in Spark SQL takes much longer than the > previous Hive implementation for large data sets (7B rows grouped into 200k > buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark > 2.1.0. > The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, > this does not finish at all in more than 2 hours. Also tried this with > different accuracy values 5000,1000,500, the timing does get better with > smaller datasets with the new version, but the speed difference is > insignificant > > Infrastructure used: > AWS EMR -> Spark 2.1.0 > vs > AWS EMR -> Spark 2.3.1 > > spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g > --conf spark.sql.shuffle.partitions=2000 --conf > spark.default.parallelism=2000 --num-executors=75 --executor-cores=2 > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df=spark.range(70L).withColumn("some_grouping_id", > round(rand()*20L).cast(LongType)) > df.createOrReplaceTempView("tab") > val percentile_query = """ select some_grouping_id, percentile_approx(id, > array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ > spark.sql(percentile_query).collect() > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets
[ https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582383#comment-16582383 ] Marco Gaido edited comment on SPARK-25125 at 8/16/18 1:07 PM: -- I think this may be a duplicate of SPARK-24013. [~myali] may you please try and check whether current master still have the issue? was (Author: mgaido): I think his may be a duplicate of SPARK-24013. [~myali] may you please try and check whether current master still have the issue? If > Spark SQL percentile_approx takes longer than Hive version for large datasets > - > > Key: SPARK-25125 > URL: https://issues.apache.org/jira/browse/SPARK-25125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Mir Ali >Priority: Major > > The percentile_approx function in Spark SQL takes much longer than the > previous Hive implementation for large data sets (7B rows grouped into 200k > buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark > 2.1.0. > The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, > this does not finish at all in more than 2 hours. Also tried this with > different accuracy values 5000,1000,500, the timing does get better with > smaller datasets with the new version, but the speed difference is > insignificant > > Infrastructure used: > AWS EMR -> Spark 2.1.0 > vs > AWS EMR -> Spark 2.3.1 > > spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g > --conf spark.sql.shuffle.partitions=2000 --conf > spark.default.parallelism=2000 --num-executors=75 --executor-cores=2 > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df=spark.range(70L).withColumn("some_grouping_id", > round(rand()*20L).cast(LongType)) > df.createOrReplaceTempView("tab") > val percentile_query = """ select some_grouping_id, percentile_approx(id, > array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ > spark.sql(percentile_query).collect() > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets
[ https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582383#comment-16582383 ] Marco Gaido commented on SPARK-25125: - I think his may be a duplicate of SPARK-25125. [~myali] may you please try and check whether current master still have the issue? If > Spark SQL percentile_approx takes longer than Hive version for large datasets > - > > Key: SPARK-25125 > URL: https://issues.apache.org/jira/browse/SPARK-25125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Mir Ali >Priority: Major > > The percentile_approx function in Spark SQL takes much longer than the > previous Hive implementation for large data sets (7B rows grouped into 200k > buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark > 2.1.0. > The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, > this does not finish at all in more than 2 hours. Also tried this with > different accuracy values 5000,1000,500, the timing does get better with > smaller datasets with the new version, but the speed difference is > insignificant > > Infrastructure used: > AWS EMR -> Spark 2.1.0 > vs > AWS EMR -> Spark 2.3.1 > > spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g > --conf spark.sql.shuffle.partitions=2000 --conf > spark.default.parallelism=2000 --num-executors=75 --executor-cores=2 > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df=spark.range(70L).withColumn("some_grouping_id", > round(rand()*20L).cast(LongType)) > df.createOrReplaceTempView("tab") > val percentile_query = """ select some_grouping_id, percentile_approx(id, > array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ > spark.sql(percentile_query).collect() > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23908) High-order function: transform(array, function) → array
[ https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581364#comment-16581364 ] Marco Gaido commented on SPARK-23908: - [~huaxingao] they are not exposed through the Scala API, so they don't either on the other APIs. Thanks. > High-order function: transform(array, function) → array > --- > > Key: SPARK-23908 > URL: https://issues.apache.org/jira/browse/SPARK-23908 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array that is the result of applying function to each element of > array: > {noformat} > SELECT transform(ARRAY [], x -> x + 1); -- [] > SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7] > SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7] > SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', > 'z0'] > SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x > -> x IS NOT NULL)); -- [[1, 2], [3]] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy
[ https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581075#comment-16581075 ] Marco Gaido commented on LIVY-489: -- Sure [~jerryshao], thank you. I am submitting the first PR for 2, 3, 4. Thanks. > Expose a JDBC endpoint for Livy > --- > > Key: LIVY-489 > URL: https://issues.apache.org/jira/browse/LIVY-489 > Project: Livy > Issue Type: New Feature > Components: API, Server >Affects Versions: 0.6.0 >Reporter: Marco Gaido >Priority: Major > > Many users and BI tools use JDBC connections in order to retrieve data. As > Livy exposes only a REST API, this is a limitation in its adoption. Hence, > adding a JDBC endpoint may be a very useful feature, which could also make > Livy a more attractive solution for end user to adopt. > Moreover, currently, Spark exposes a JDBC interface, but this has many > limitations, including that all the queries are submitted to the same > application, therefore there is no isolation/security, which can be offered > by Livy, making a Livy JDBC API a better solution for companies/users who > want to use Spark in order to run they queries through JDBC. > In order to make the transition from existing solutions to the new JDBC > server seamless, the proposal is to use the Hive thrift-server and extend it > as it was done by the STS. > [Here, you can find the design > doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SPARK-25123) SimpleExprValue may cause the loss of a reference
Marco Gaido created SPARK-25123: --- Summary: SimpleExprValue may cause the loss of a reference Key: SPARK-25123 URL: https://issues.apache.org/jira/browse/SPARK-25123 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Marco Gaido While introducing the new JavaCode abstraction in order to enable tracking references and allowing transformations, we added 3 types of expression values. They are global variables, local variables and simple expressions. While checking whether we could use this new abstraction for fixing an issue reported in another JIRA, I just realized that SimpleExprValue contains a string with the generated code, but this can actually contain other variables. Since the value is carried in SimpleExprValue is a string, though, we were loosing track of the variable reference. So this JIRA is for using a Block in order to represent the java code carried by SimpleExprValue, so that we don't loose references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579514#comment-16579514 ] Marco Gaido commented on SPARK-25051: - This was caused by the introduction of AnalysisBarrier. I will submit a PR for branch 2.3. On 2.4+ (current master) we don't have anymore this issue because AnalysisBarrier was removed. Anyway, this brings a question to me: shall we remove AnalysisBarrier from 2.3 line too? In the current situation, backporting any analyzer fix to 2.3 is going to be painful. cc [~rxin] [~cloud_fan] > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579444#comment-16579444 ] Marco Gaido commented on SPARK-25051: - cc [~jerryshao] shall we set it as a blocker for 2.3.2? > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25051: Labels: correctness (was: ) > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > Labels: correctness > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578447#comment-16578447 ] Marco Gaido commented on SPARK-24928: - Actually this is a duplicate of SPARK-11982, which solved the issue for the SQL API. For the RDD API, please be careful choosing the right side of the cartesian. I am closing this as a duplicate. Feel free to reopen if you think anything else can be done. > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-24928. - Resolution: Duplicate > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb
[ https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578348#comment-16578348 ] Marco Gaido commented on SPARK-25094: - [~igreenfi] as I mentioned you, this is a known issue. You found a TODO because currently it is not possible to implement that TODO. There is an ongoing effort to make it happening, but it is a huge effort, so it will take time. Thanks. > proccesNext() failed to compile size is over 64kb > - > > Key: SPARK-25094 > URL: https://issues.apache.org/jira/browse/SPARK-25094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > Attachments: generated_code.txt > > > I have this tree: > 2018-08-12T07:14:31,289 WARN [] > org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen > disabled for plan (id=1): > *(1) Project [, ... 10 more fields] > +- *(1) Filter NOT exposure_calc_method#10141 IN > (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES) >+- InMemoryTableScan [, ... 11 more fields], [NOT > exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)] > +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, > deserialized, 1 replicas) >+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner > :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0 > : +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > : +- *(1) Project [, ... 6 more fields] > :+- *(1) Filter (isnotnull(v#49) && > isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) > && (v#49 = DATA_REG)) && isnotnull(unique_id#39)) > : +- InMemoryTableScan [, ... 6 more fields], [, > ... 6 more fields] > : +- InMemoryRelation [, ... 6 more > fields], StorageLevel(memory, deserialized, 1 replicas) > : +- *(1) FileScan csv [,... 6 more > fields] , ... 6 more fields > +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > +- *(3) Project [, ... 74 more fields] >+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 > <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54)) > +- InMemoryTableScan [, ... 74 more fields], [, > ... 4 more fields] > +- InMemoryRelation [, ... 74 more > fields], StorageLevel(memory, deserialized, 1 replicas) > +- *(1) FileScan csv [,... 74 more > fields] , ... 6 more fields > Compiling "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > and the generated code failed to compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy
[ https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578237#comment-16578237 ] Marco Gaido commented on LIVY-489: -- [~jerryshao] I created 5 subtasks for this. Hope they are reasonable to you. Thanks. > Expose a JDBC endpoint for Livy > --- > > Key: LIVY-489 > URL: https://issues.apache.org/jira/browse/LIVY-489 > Project: Livy > Issue Type: New Feature > Components: API, Server >Affects Versions: 0.6.0 >Reporter: Marco Gaido >Priority: Major > > Many users and BI tools use JDBC connections in order to retrieve data. As > Livy exposes only a REST API, this is a limitation in its adoption. Hence, > adding a JDBC endpoint may be a very useful feature, which could also make > Livy a more attractive solution for end user to adopt. > Moreover, currently, Spark exposes a JDBC interface, but this has many > limitations, including that all the queries are submitted to the same > application, therefore there is no isolation/security, which can be offered > by Livy, making a Livy JDBC API a better solution for companies/users who > want to use Spark in order to run they queries through JDBC. > In order to make the transition from existing solutions to the new JDBC > server seamless, the proposal is to use the Hive thrift-server and extend it > as it was done by the STS. > [Here, you can find the design > doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-495) Add basic UI for thriftserver
Marco Gaido created LIVY-495: Summary: Add basic UI for thriftserver Key: LIVY-495 URL: https://issues.apache.org/jira/browse/LIVY-495 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido The issue tracks the implementation of a UI showing basic information about the status of the Livy thriftserver. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-494) Add thriftserver to Livy server
Marco Gaido created LIVY-494: Summary: Add thriftserver to Livy server Key: LIVY-494 URL: https://issues.apache.org/jira/browse/LIVY-494 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido Including the thriftserver in the Livy server. This means starting the Thriftserver at Livy server startup and adding the needed script in order to interact with it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (LIVY-493) Add UTs to the thriftserver module
[ https://issues.apache.org/jira/browse/LIVY-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated LIVY-493: - Description: Tracks the implementation and addition of UT for the new Livy thriftserver. > Add UTs to the thriftserver module > -- > > Key: LIVY-493 > URL: https://issues.apache.org/jira/browse/LIVY-493 > Project: Livy > Issue Type: Sub-task >Reporter: Marco Gaido >Priority: Major > > Tracks the implementation and addition of UT for the new Livy thriftserver. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-493) Add UTs to the thriftserver module
Marco Gaido created LIVY-493: Summary: Add UTs to the thriftserver module Key: LIVY-493 URL: https://issues.apache.org/jira/browse/LIVY-493 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-492) Base implementation Livy thriftserver
Marco Gaido created LIVY-492: Summary: Base implementation Livy thriftserver Key: LIVY-492 URL: https://issues.apache.org/jira/browse/LIVY-492 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido The issue tracks the lading of the initial implementation of the Livy thriftserver -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (LIVY-490) Add thriftserver module
[ https://issues.apache.org/jira/browse/LIVY-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido closed LIVY-490. Resolution: Duplicate > Add thriftserver module > --- > > Key: LIVY-490 > URL: https://issues.apache.org/jira/browse/LIVY-490 > Project: Livy > Issue Type: Sub-task >Reporter: Marco Gaido >Priority: Major > > Add a new module for the Thriftserver implementation -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-491) Add thriftserver module
Marco Gaido created LIVY-491: Summary: Add thriftserver module Key: LIVY-491 URL: https://issues.apache.org/jira/browse/LIVY-491 Project: Livy Issue Type: Sub-task Components: Server Affects Versions: 0.6.0 Reporter: Marco Gaido Add a new module for the implementation of the Livy thriftserver. This includes adding the base thriftserver implementation from Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-490) Add thriftserver module
Marco Gaido created LIVY-490: Summary: Add thriftserver module Key: LIVY-490 URL: https://issues.apache.org/jira/browse/LIVY-490 Project: Livy Issue Type: Sub-task Reporter: Marco Gaido Add a new module for the Thriftserver implementation -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy
[ https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578226#comment-16578226 ] Marco Gaido commented on LIVY-489: -- Sure [~jerryshao], the branch is https://github.com/mgaido91/incubator-livy/tree/livy_thrift, and the diff is https://github.com/apache/incubator-livy/compare/master...mgaido91:livy_thrift. > Expose a JDBC endpoint for Livy > --- > > Key: LIVY-489 > URL: https://issues.apache.org/jira/browse/LIVY-489 > Project: Livy > Issue Type: New Feature > Components: API, Server >Affects Versions: 0.6.0 >Reporter: Marco Gaido >Priority: Major > > Many users and BI tools use JDBC connections in order to retrieve data. As > Livy exposes only a REST API, this is a limitation in its adoption. Hence, > adding a JDBC endpoint may be a very useful feature, which could also make > Livy a more attractive solution for end user to adopt. > Moreover, currently, Spark exposes a JDBC interface, but this has many > limitations, including that all the queries are submitted to the same > application, therefore there is no isolation/security, which can be offered > by Livy, making a Livy JDBC API a better solution for companies/users who > want to use Spark in order to run they queries through JDBC. > In order to make the transition from existing solutions to the new JDBC > server seamless, the proposal is to use the Hive thrift-server and extend it > as it was done by the STS. > [Here, you can find the design > doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-25093) CodeFormatter could avoid creating regex object again and again
[ https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578038#comment-16578038 ] Marco Gaido commented on SPARK-25093: - I just marked this as a minor priority ticket, anyway I agree with the proposed improvement. Are you submitting a PR for it? Thanks. > CodeFormatter could avoid creating regex object again and again > --- > > Key: SPARK-25093 > URL: https://issues.apache.org/jira/browse/SPARK-25093 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Minor > > in class `CodeFormatter` > method: `stripExtraNewLinesAndComments` > could be refactored to: > {code:scala} > // Some comments here > val commentReg = > ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/ > """([ |\t]*?\/\/[\s\S]*?\n)""").r // strip //comment > val emptyRowsReg = """\n\s*\n""".r > def stripExtraNewLinesAndComments(input: String): String = { > val codeWithoutComment = commentReg.replaceAllIn(input, "") > emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines > } > {code} > so the Regex would be compiled only once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25093) CodeFormatter could avoid creating regex object again and again
[ https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25093: Priority: Minor (was: Major) > CodeFormatter could avoid creating regex object again and again > --- > > Key: SPARK-25093 > URL: https://issues.apache.org/jira/browse/SPARK-25093 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Minor > > in class `CodeFormatter` > method: `stripExtraNewLinesAndComments` > could be refactored to: > {code:scala} > // Some comments here > val commentReg = > ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/ > """([ |\t]*?\/\/[\s\S]*?\n)""").r // strip //comment > val emptyRowsReg = """\n\s*\n""".r > def stripExtraNewLinesAndComments(input: String): String = { > val codeWithoutComment = commentReg.replaceAllIn(input, "") > emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines > } > {code} > so the Regex would be compiled only once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy
[ https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577995#comment-16577995 ] Marco Gaido commented on LIVY-489: -- Hi [~jerryshao]. Thanks for your comment. Unfortunately I am not sure how to spit the implementation as most of the code is required for it to work. s of now, the only two tasks I have been able to split it into are: - Initial Thriftserver implementation; - Adding a Thriftserver UI. I will keep on thinking on this anyway. Any suggestion is welcomed. Thanks. > Expose a JDBC endpoint for Livy > --- > > Key: LIVY-489 > URL: https://issues.apache.org/jira/browse/LIVY-489 > Project: Livy > Issue Type: New Feature > Components: API, Server >Affects Versions: 0.6.0 >Reporter: Marco Gaido >Priority: Major > > Many users and BI tools use JDBC connections in order to retrieve data. As > Livy exposes only a REST API, this is a limitation in its adoption. Hence, > adding a JDBC endpoint may be a very useful feature, which could also make > Livy a more attractive solution for end user to adopt. > Moreover, currently, Spark exposes a JDBC interface, but this has many > limitations, including that all the queries are submitted to the same > application, therefore there is no isolation/security, which can be offered > by Livy, making a Livy JDBC API a better solution for companies/users who > want to use Spark in order to run they queries through JDBC. > In order to make the transition from existing solutions to the new JDBC > server seamless, the proposal is to use the Hive thrift-server and extend it > as it was done by the STS. > [Here, you can find the design > doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb
[ https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577593#comment-16577593 ] Marco Gaido commented on SPARK-25094: - This is a duplicate of many. Unfortunately this problem has not yet been solved, so in this case whole-stage code generation is disabled for the query. There is an ongoing effort in order to enable to fix this issue in the future though. > proccesNext() failed to compile size is over 64kb > - > > Key: SPARK-25094 > URL: https://issues.apache.org/jira/browse/SPARK-25094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > > I have this tree: > 2018-08-12T07:14:31,289 WARN [] > org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen > disabled for plan (id=1): > *(1) Project [, ... 10 more fields] > +- *(1) Filter NOT exposure_calc_method#10141 IN > (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES) >+- InMemoryTableScan [, ... 11 more fields], [NOT > exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)] > +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, > deserialized, 1 replicas) >+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner > :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0 > : +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > : +- *(1) Project [, ... 6 more fields] > :+- *(1) Filter (isnotnull(v#49) && > isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) > && (v#49 = DATA_REG)) && isnotnull(unique_id#39)) > : +- InMemoryTableScan [, ... 6 more fields], [, > ... 6 more fields] > : +- InMemoryRelation [, ... 6 more > fields], StorageLevel(memory, deserialized, 1 replicas) > : +- *(1) FileScan csv [,... 6 more > fields] , ... 6 more fields > +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > +- *(3) Project [, ... 74 more fields] >+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 > <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54)) > +- InMemoryTableScan [, ... 74 more fields], [, > ... 4 more fields] > +- InMemoryRelation [, ... 74 more > fields], StorageLevel(memory, deserialized, 1 replicas) > +- *(1) FileScan csv [,... 74 more > fields] , ... 6 more fields > Compiling "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > and the generated code failed to compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (LIVY-489) Expose a JDBC endpoint for Livy
[ https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated LIVY-489: - Description: Many users and BI tools use JDBC connections in order to retrieve data. As Livy exposes only a REST API, this is a limitation in its adoption. Hence, adding a JDBC endpoint may be a very useful feature, which could also make Livy a more attractive solution for end user to adopt. Moreover, currently, Spark exposes a JDBC interface, but this has many limitations, including that all the queries are submitted to the same application, therefore there is no isolation/security, which can be offered by Livy, making a Livy JDBC API a better solution for companies/users who want to use Spark in order to run they queries through JDBC. In order to make the transition from existing solutions to the new JDBC server seamless, the proposal is to use the Hive thrift-server and extend it as it was done by the STS. [Here, you can find the design doc.|https://drive.google.com/file/d/10r8aF1xmL2MTtuREawGcrJobMf5Abtts/view?usp=sharing] was: Many users and BI tools use JDBC connections in order to retrieve data. As Livy exposes only a REST API, this is a limitation in its adoption. Hence, adding a JDBC endpoint may be a very useful feature, which could also make Livy a more attractive solution for end user to adopt. Moreover, currently, Spark exposes a JDBC interface, but this has many limitations, including that all the queries are submitted to the same application, therefore there is no isolation/security, which can be offered by Livy, making a Livy JDBC API a better solution for companies/users who want to use Spark in order to run they queries through JDBC. In order to make the transition from existing solutions to the new JDBC server seamless, the proposal is to use the Hive thrift-server and extend it as it was done by the STS. [Here, you can find the design doc.|https://docs.google.com/a/hortonworks.com/document/d/e/2PACX-1vS-ffJwXJ5nZluV-81AJ4WvS3SFX_KcZ0Djz9QGeEtLullYdLHT8dJvuwPpLBT2s3EU4CO6ij14wVcv/pub] > Expose a JDBC endpoint for Livy > --- > > Key: LIVY-489 > URL: https://issues.apache.org/jira/browse/LIVY-489 > Project: Livy > Issue Type: New Feature > Components: API, Server >Affects Versions: 0.6.0 >Reporter: Marco Gaido >Priority: Major > > Many users and BI tools use JDBC connections in order to retrieve data. As > Livy exposes only a REST API, this is a limitation in its adoption. Hence, > adding a JDBC endpoint may be a very useful feature, which could also make > Livy a more attractive solution for end user to adopt. > Moreover, currently, Spark exposes a JDBC interface, but this has many > limitations, including that all the queries are submitted to the same > application, therefore there is no isolation/security, which can be offered > by Livy, making a Livy JDBC API a better solution for companies/users who > want to use Spark in order to run they queries through JDBC. > In order to make the transition from existing solutions to the new JDBC > server seamless, the proposal is to use the Hive thrift-server and extend it > as it was done by the STS. > [Here, you can find the design > doc.|https://drive.google.com/file/d/10r8aF1xmL2MTtuREawGcrJobMf5Abtts/view?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (LIVY-489) Expose a JDBC endpoint for Livy
Marco Gaido created LIVY-489: Summary: Expose a JDBC endpoint for Livy Key: LIVY-489 URL: https://issues.apache.org/jira/browse/LIVY-489 Project: Livy Issue Type: New Feature Components: API, Server Affects Versions: 0.6.0 Reporter: Marco Gaido Many users and BI tools use JDBC connections in order to retrieve data. As Livy exposes only a REST API, this is a limitation in its adoption. Hence, adding a JDBC endpoint may be a very useful feature, which could also make Livy a more attractive solution for end user to adopt. Moreover, currently, Spark exposes a JDBC interface, but this has many limitations, including that all the queries are submitted to the same application, therefore there is no isolation/security, which can be offered by Livy, making a Livy JDBC API a better solution for companies/users who want to use Spark in order to run they queries through JDBC. In order to make the transition from existing solutions to the new JDBC server seamless, the proposal is to use the Hive thrift-server and extend it as it was done by the STS. [Here, you can find the design doc.|https://docs.google.com/a/hortonworks.com/document/d/e/2PACX-1vS-ffJwXJ5nZluV-81AJ4WvS3SFX_KcZ0Djz9QGeEtLullYdLHT8dJvuwPpLBT2s3EU4CO6ij14wVcv/pub] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-25031) The schema of MapType can not be printed correctly
[ https://issues.apache.org/jira/browse/SPARK-25031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573299#comment-16573299 ] Marco Gaido commented on SPARK-25031: - [~smilegator] shall this be resolved as https://github.com/apache/spark/pull/22006 was merged? Thanks. > The schema of MapType can not be printed correctly > -- > > Key: SPARK-25031 > URL: https://issues.apache.org/jira/browse/SPARK-25031 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Hao Ren >Priority: Minor > Labels: easyfix > > Something wrong with the function `buildFormattedString` in `MapType` > > {code:java} > import spark.implicits._ > case class Key(a: Int) > case class Value(b: Int) > Seq( > (1, Map(Key(1) -> Value(2))), > (2, Map(Key(1) -> Value(2))) > ).toDF("id", "dict").printSchema > {code} > The result is: > {code:java} > root > |-- id: integer (nullable = false) > |-- dict: map (nullable = true) > | |-- key: struct > | |-- value: struct (valueContainsNull = true) > | | |-- a: integer (nullable = false) > | | |-- b: integer (nullable = false) > {code} > The expected is > {code:java} > root > |-- id: integer (nullable = false) > |-- dict: map (nullable = true) > | |-- key: struct > | | |-- a: integer (nullable = false) > | |-- value: struct (valueContainsNull = true) > | | |-- b: integer (nullable = false) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25042) Flaky test: org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic
Marco Gaido created SPARK-25042: --- Summary: Flaky test: org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic Key: SPARK-25042 URL: https://issues.apache.org/jira/browse/SPARK-25042 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.4.0 Reporter: Marco Gaido The test {{compacted topic}} in {{org.apache.spark.streaming.kafka010.KafkaRDDSuite}} is flaky: it failed in an unrelated PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94293/testReport/. And it passes locally on the same branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570289#comment-16570289 ] Marco Gaido edited comment on SPARK-24928 at 8/6/18 4:13 PM: - [~matthewnormyle] the fix you are proposing doesn't solve the problem, but it returns a wrong result. The root cause of the issue here is that the for is a nested loop. So if the outer iterator is the small one, we build much less iterators than otherwise. I think that in the RDD case there is few we can do, while for the SQL case we can probably add an optimizer rule using the statistics (if they are available). PS I will submit soon a PR with the Optimizer rule to use the best side to build the nested loop if we have the stats. I don't think we can do anything else. Thanks. was (Author: mgaido): [~matthewnormyle] the fix you are proposing doesn't solve the problem, but it returns a wrong result. The root cause of the issue here is that the for is a nested loop. So if the outer iterator is the small one, we build much less iterators than otherwise. I think that in the RDD case there is few we can do, while for the SQL case we can probably add an optimizer rule using the statistics (if they are available). > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570289#comment-16570289 ] Marco Gaido edited comment on SPARK-24928 at 8/6/18 2:45 PM: - [~matthewnormyle] the fix you are proposing doesn't solve the problem, but it returns a wrong result. The root cause of the issue here is that the for is a nested loop. So if the outer iterator is the small one, we build much less iterators than otherwise. I think that in the RDD case there is few we can do, while for the SQL case we can probably add an optimizer rule using the statistics (if they are available). was (Author: mgaido): [~matthewnormyle] the fix you are proposing doesn't solve the problem, but it returns a wrong result. The root cause of the issue here is that the for is a nested loop. So if the outer iterator is the small one, we build much less iterators than otherwise. I think that in the RDD case there is few we can do, while for the SQL case we can probably add an optimizer rule using the statistics (if they are computed). > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570289#comment-16570289 ] Marco Gaido commented on SPARK-24928: - [~matthewnormyle] the fix you are proposing doesn't solve the problem, but it returns a wrong result. The root cause of the issue here is that the for is a nested loop. So if the outer iterator is the small one, we build much less iterators than otherwise. I think that in the RDD case there is few we can do, while for the SQL case we can probably add an optimizer rule using the statistics (if they are computed). > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25012) dataframe creation results in matcherror
[ https://issues.apache.org/jira/browse/SPARK-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570022#comment-16570022 ] Marco Gaido commented on SPARK-25012: - [~simm] you're right that the error message doesn't help and indeed it was fixed in SPARK-24366. So if you try in current master branch (or in the upcoming 2.4 release when it will be out), you should get a more meaningful error message which may help you in your debugging. I am not sure about the root cause of the "random" behavior of your test cases, but I think it is caused by some misuse in your code. > dataframe creation results in matcherror > > > Key: SPARK-25012 > URL: https://issues.apache.org/jira/browse/SPARK-25012 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.1 > Environment: spark 2.3.1 > mac > scala 2.11.12 > >Reporter: uwe >Priority: Major > > hi, > > running the attached code results in a > > {code:java} > scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp) > {code} > # i do think this is wrong (at least i do not see the issue in my code) > # the error is the ein 90% of the cases (it sometimes passes). that makes me > think something weird is going on > > > {code:java} > package misc > import java.sql.Timestamp > import java.time.LocalDateTime > import java.time.format.DateTimeFormatter > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.sources._ > import org.apache.spark.sql.types.{StringType, StructField, StructType, > TimestampType} > import org.apache.spark.sql.{Row, SQLContext, SparkSession} > case class LogRecord(application:String, dateTime: Timestamp, component: > String, level: String, message: String) > class LogRelation(val sqlContext: SQLContext, val path: String) extends > BaseRelation with PrunedFilteredScan { > override def schema: StructType = StructType(Seq( > StructField("application", StringType, false), > StructField("dateTime", TimestampType, false), > StructField("component", StringType, false), > StructField("level", StringType, false), > StructField("message", StringType, false))) > override def buildScan(requiredColumns: Array[String], filters: > Array[Filter]): RDD[Row] = { > val str = "2017-02-09T00:09:27" > val ts =Timestamp.valueOf(LocalDateTime.parse(str, > DateTimeFormatter.ISO_LOCAL_DATE_TIME)) > val > data=List(Row("app",ts,"comp","level","mess"),Row("app",ts,"comp","level","mess")) > sqlContext.sparkContext.parallelize(data) > } > } > class LogDataSource extends DataSourceRegister with RelationProvider { > override def shortName(): String = "log" > override def createRelation(sqlContext: SQLContext, parameters: Map[String, > String]): BaseRelation = > new LogRelation(sqlContext, parameters("path")) > } > object f0 extends App { > lazy val spark: SparkSession = > SparkSession.builder().master("local").appName("spark session").getOrCreate() > val df = spark.read.format("log").load("hdfs:///logs") > df.show() > } > > {code} > > results in the following stacktrace > > {noformat} > 11:20:06 [task-result-getter-0] ERROR o.a.spark.scheduler.TaskSetManager - > Task 0 in stage 0.0 failed 1 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): > scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379) > at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60) > at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at >
[jira] [Commented] (SPARK-25012) dataframe creation results in matcherror
[ https://issues.apache.org/jira/browse/SPARK-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569960#comment-16569960 ] Marco Gaido commented on SPARK-25012: - Seems the same as SPARK-24366. Seems anyway a problem in you schema definition/column mappings. > dataframe creation results in matcherror > > > Key: SPARK-25012 > URL: https://issues.apache.org/jira/browse/SPARK-25012 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.1 > Environment: spark 2.3.1 > mac > scala 2.11.12 > >Reporter: uwe >Priority: Major > > hi, > > running the attached code results in a > > {code:java} > scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp) > {code} > # i do think this is wrong (at least i do not see the issue in my code) > # the error is the ein 90% of the cases (it sometimes passes). that makes me > think something weird is going on > > > {code:java} > package misc > import java.sql.Timestamp > import java.time.LocalDateTime > import java.time.format.DateTimeFormatter > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.sources._ > import org.apache.spark.sql.types.{StringType, StructField, StructType, > TimestampType} > import org.apache.spark.sql.{Row, SQLContext, SparkSession} > case class LogRecord(application:String, dateTime: Timestamp, component: > String, level: String, message: String) > class LogRelation(val sqlContext: SQLContext, val path: String) extends > BaseRelation with PrunedFilteredScan { > override def schema: StructType = StructType(Seq( > StructField("application", StringType, false), > StructField("dateTime", TimestampType, false), > StructField("component", StringType, false), > StructField("level", StringType, false), > StructField("message", StringType, false))) > override def buildScan(requiredColumns: Array[String], filters: > Array[Filter]): RDD[Row] = { > val str = "2017-02-09T00:09:27" > val ts =Timestamp.valueOf(LocalDateTime.parse(str, > DateTimeFormatter.ISO_LOCAL_DATE_TIME)) > val > data=List(Row("app",ts,"comp","level","mess"),Row("app",ts,"comp","level","mess")) > sqlContext.sparkContext.parallelize(data) > } > } > class LogDataSource extends DataSourceRegister with RelationProvider { > override def shortName(): String = "log" > override def createRelation(sqlContext: SQLContext, parameters: Map[String, > String]): BaseRelation = > new LogRelation(sqlContext, parameters("path")) > } > object f0 extends App { > lazy val spark: SparkSession = > SparkSession.builder().master("local").appName("spark session").getOrCreate() > val df = spark.read.format("log").load("hdfs:///logs") > df.show() > } > > {code} > > results in the following stacktrace > > {noformat} > 11:20:06 [task-result-getter-0] ERROR o.a.spark.scheduler.TaskSetManager - > Task 0 in stage 0.0 failed 1 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): > scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379) > at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60) > at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at
[jira] [Commented] (SPARK-23937) High-order function: map_filter(map, function) → MAP
[ https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568152#comment-16568152 ] Marco Gaido commented on SPARK-23937: - I am working on this, thanks. > High-order function: map_filter(map, function) → MAP > -- > > Key: SPARK-23937 > URL: https://issues.apache.org/jira/browse/SPARK-23937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Constructs a map from those entries of map for which function returns true: > {noformat} > SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {} > SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v > IS NOT NULL); -- {10 -> a, 30 -> c} > SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v > > 10); -- {k1 -> 20, k3 -> 15} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24598) SPARK SQL:Datatype overflow conditions gives incorrect result
[ https://issues.apache.org/jira/browse/SPARK-24598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568005#comment-16568005 ] Marco Gaido commented on SPARK-24598: - [~smilegator] as we just enhanced the doc, but we have not really addressed the overflow condition, which I think we are targeting for a fix for 3.0, shall we leave this open for now and resolve it once the actual fix is in place? What do you think? Thanks. > SPARK SQL:Datatype overflow conditions gives incorrect result > - > > Key: SPARK-24598 > URL: https://issues.apache.org/jira/browse/SPARK-24598 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: navya >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > Execute an sql query, so that it results in overflow conditions. > EX - SELECT 9223372036854775807 + 1 result = -9223372036854776000 > > Expected result - Error should be throw like mysql. > mysql> SELECT 9223372036854775807 + 1; > ERROR 1690 (22003): BIGINT value is out of range in '(9223372036854775807 + > 1)' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404
[ https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563259#comment-16563259 ] Marco Gaido commented on SPARK-24975: - This seems a duplicate of SPARK-24188. Despite here I see that 2.3.1 is affected, while this should not be the case according to SPARK-24188. May you please check if 2.3.1 is actually affected and if not close this as duplicate? Thanks. > Spark history server REST API /api/v1/version returns error 404 > --- > > Key: SPARK-24975 > URL: https://issues.apache.org/jira/browse/SPARK-24975 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: shanyu zhao >Priority: Major > > Spark history server REST API provides /api/v1/version, according to doc: > [https://spark.apache.org/docs/latest/monitoring.html] > However, for Spark 2.3, we see: > {code:java} > curl http://localhost:18080/api/v1/version > > > > Error 404 Not Found > > HTTP ERROR 404 > Problem accessing /api/v1/version. Reason: > Not Foundhttp://eclipse.org/jetty;>Powered by > Jetty:// 9.3.z-SNAPSHOT > > {code} > On a Spark 2.2 cluster, we see: > {code:java} > curl http://localhost:18080/api/v1/version > { > "spark" : "2.2.0" > }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24944) SparkUi build problem
[ https://issues.apache.org/jira/browse/SPARK-24944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561587#comment-16561587 ] Marco Gaido commented on SPARK-24944: - Can you close this JIRA as invalid? Thanks. > SparkUi build problem > - > > Key: SPARK-24944 > URL: https://issues.apache.org/jira/browse/SPARK-24944 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0, 2.3.1 > Environment: scala 2.11.8 > java version "1.8.0_181" > Java(TM) SE Runtime Environment (build 1.8.0_181-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) > > Gradle 4.5.1 > > Build time: 2018-02-05 13:22:49 UTC > Revision: 37007e1c012001ff09973e0bd095139239ecd3b3 > Groovy: 2.4.12 > Ant: Apache Ant(TM) version 1.9.9 compiled on February 2 2017 > JVM: 1.8.0_181 (Oracle Corporation 25.181-b13) > OS: Windows 7 6.1 amd64 > > build.gradle: > group 'it.build-test.spark' > version '1.0-SNAPSHOT' > apply plugin: 'java' > apply plugin: 'scala' > sourceCompatibility = 1.8 > repositories { > mavenCentral() > } > dependencies { > compile 'org.apache.spark:spark-core_2.11:2.3.1' > compile 'org.scala-lang:scala-library:2.11.8' > } > tasks.withType(ScalaCompile) { > scalaCompileOptions.additionalParameters = ["-Ylog-classpath"] > } >Reporter: Fabio >Priority: Minor > Labels: UI, WebUI, build > Attachments: build-test.zip > > > Hi. I'm trying to customize SparkUi with my business logic. Trying to access > to ui, I have ta build problem. It's enough to create this class: > _package org.apache.spark_ > _import org.apache.spark.ui.SparkUI_ > _case class SparkContextUtils(sc: SparkContext) {_ > _def ui: Option[SparkUI] = sc.ui_ > _}_ > > to have this error: > > _missing or invalid dependency detected while loading class file > 'WebUI.class'._ > _Could not access term eclipse in package org,_ > _because it (or its dependencies) are missing. Check your build definition > for_ > _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see > the problematic classpath.)_ > _A full rebuild may help if 'WebUI.class' was compiled against an > incompatible version of org._ > _missing or invalid dependency detected while loading class file > 'WebUI.class'._ > _Could not access term jetty in value org.eclipse,_ > _because it (or its dependencies) are missing. Check your build definition > for_ > _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see > the problematic classpath.)_ > _A full rebuild may help if 'WebUI.class' was compiled against an > incompatible version of org.eclipse._ > _two errors found_ > _:compileScala FAILED_ > _FAILURE: Build failed with an exception._ > _* What went wrong:_ > _Execution failed for task ':compileScala'._ > _> Compilation failed_ > _* Try:_ > _Run with --stacktrace option to get the stack trace. Run with --info or > --debug option to get more log output. Run with --scan to get full insights._ > _* Get more help at https://help.gradle.org_ > _BUILD FAILED in 26s_ > _1 actionable task: 1 executed_ > _Compilation failed_ > > The option "-Ylog-classpath" hasn't any useful information > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen
[ https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561077#comment-16561077 ] Marco Gaido commented on SPARK-24957: - I am not sure what you mean by "When codegen is disabled all results are correct.". I checked and I was able to reproduce both with codegen enabled and with codegen disabled. cc [~jerryshao] this doesn't seem a regression to me but it is a pretty serious bug, I am not sure whether we should include it in the next 2.3 version. cc [~smilegator] [~cloud_fan] I think we should consider this a blocker for 2.4. What do you think? Thanks. > Decimal arithmetic can lead to wrong values using codegen > - > > Key: SPARK-24957 > URL: https://issues.apache.org/jira/browse/SPARK-24957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Priority: Major > > I noticed a bug when doing arithmetic on a dataframe containing decimal > values with codegen enabled. > I tried to narrow it down on a small repro and got this (executed in > spark-shell): > {noformat} > scala> val df = Seq( > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("11.88")) > | ).toDF("text", "number") > df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)] > scala> val df_grouped_1 = > df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number")) > df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_1.collect() > res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143]) > scala> val df_grouped_2 = > df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_2.collect() > res1: Array[org.apache.spark.sql.Row] = > Array([a,11948571.4285714285714285714286]) > scala> val df_total_sum = > df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)] > scala> df_total_sum.collect() > res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143]) > {noformat} > The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the > result of {{df_grouped_2}} is clearly incorrect (it is the value of the > correct result times {{10^14}}). > When codegen is disabled all results are correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24948) SHS filters wrongly some applications due to permission check
Marco Gaido created SPARK-24948: --- Summary: SHS filters wrongly some applications due to permission check Key: SPARK-24948 URL: https://issues.apache.org/jira/browse/SPARK-24948 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1 Reporter: Marco Gaido SHS filters the event logs it doesn't have permissions to read. Unfortunately, this check is quite naive, as it takes into account only the base permissions (ie. user, group, other permissions). For instance, if ACL are enabled, they are ignored in this check; moreover, each filesystem may have different policies (eg. they can consider spark as a superuser who can access everything). This results in some applications not being displayed in the SHS, despite the Spark user (or whatever user the SHS is started with) can actually read their ent logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24944) SparkUi build problem
[ https://issues.apache.org/jira/browse/SPARK-24944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559790#comment-16559790 ] Marco Gaido commented on SPARK-24944: - This seems more a problem in your project and your dependencies than an issue in Spark. This - rather than a JIRA - should have been a question sent to the mailing list. > SparkUi build problem > - > > Key: SPARK-24944 > URL: https://issues.apache.org/jira/browse/SPARK-24944 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0, 2.3.1 > Environment: scala 2.11.8 > java version "1.8.0_181" > Java(TM) SE Runtime Environment (build 1.8.0_181-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) > > Gradle 4.5.1 > > Build time: 2018-02-05 13:22:49 UTC > Revision: 37007e1c012001ff09973e0bd095139239ecd3b3 > Groovy: 2.4.12 > Ant: Apache Ant(TM) version 1.9.9 compiled on February 2 2017 > JVM: 1.8.0_181 (Oracle Corporation 25.181-b13) > OS: Windows 7 6.1 amd64 > > build.gradle: > group 'it.build-test.spark' > version '1.0-SNAPSHOT' > apply plugin: 'java' > apply plugin: 'scala' > sourceCompatibility = 1.8 > repositories { > mavenCentral() > } > dependencies { > compile 'org.apache.spark:spark-core_2.11:2.3.1' > compile 'org.scala-lang:scala-library:2.11.8' > } > tasks.withType(ScalaCompile) { > scalaCompileOptions.additionalParameters = ["-Ylog-classpath"] > } >Reporter: Fabio >Priority: Major > Labels: UI, WebUI, build > Attachments: build-test.zip > > > Hi. I'm trying to customize SparkUi with my business logic. Trying to access > to ui, I have ta build problem. It's enough to create this class: > _package org.apache.spark_ > _import org.apache.spark.ui.SparkUI_ > _case class SparkContextUtils(sc: SparkContext) {_ > _def ui: Option[SparkUI] = sc.ui_ > _}_ > > to have this error: > > _missing or invalid dependency detected while loading class file > 'WebUI.class'._ > _Could not access term eclipse in package org,_ > _because it (or its dependencies) are missing. Check your build definition > for_ > _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see > the problematic classpath.)_ > _A full rebuild may help if 'WebUI.class' was compiled against an > incompatible version of org._ > _missing or invalid dependency detected while loading class file > 'WebUI.class'._ > _Could not access term jetty in value org.eclipse,_ > _because it (or its dependencies) are missing. Check your build definition > for_ > _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see > the problematic classpath.)_ > _A full rebuild may help if 'WebUI.class' was compiled against an > incompatible version of org.eclipse._ > _two errors found_ > _:compileScala FAILED_ > _FAILURE: Build failed with an exception._ > _* What went wrong:_ > _Execution failed for task ':compileScala'._ > _> Compilation failed_ > _* Try:_ > _Run with --stacktrace option to get the stack trace. Run with --info or > --debug option to get more log output. Run with --scan to get full insights._ > _* Get more help at https://help.gradle.org_ > _BUILD FAILED in 26s_ > _1 actionable task: 1 executed_ > _Compilation failed_ > > The option "-Ylog-classpath" hasn't any useful information > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558287#comment-16558287 ] Marco Gaido commented on SPARK-24928: - The affected version is pretty old, can you check a newer version? > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data
[ https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555652#comment-16555652 ] Marco Gaido edited comment on SPARK-24904 at 7/25/18 1:28 PM: -- I see now what you mean, but yes, I think there is an assumption you are doing which is not always true, ie. "The output is (expected to be) very small compared to the big table". That is not true. If all the rows from the big table match the small one, this is not the case. We may trying to do something like what you mentioned in the optimizer if CBO is enabled and we have good enough statistics about the output size of the inner join, but i am not sure. was (Author: mgaido): I see now what you mean, but yes, It think there is an assumption you are doing which is not always true, ie. "The output is (expected to be) very small compared to the big table". That is not true. If all the rows from the big table match the small one, this is not the case. We may trying to do something like what you mentioned in the optimizer if CBO is enabled and we have good enough statistics about the output size of the inner join, but i am not sure. > Join with broadcasted dataframe causes shuffle of redundant data > > > Key: SPARK-24904 > URL: https://issues.apache.org/jira/browse/SPARK-24904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.2 >Reporter: Shay Elbaz >Priority: Minor > > When joining a "large" dataframe with broadcasted small one, and join-type is > on the small DF side (see right-join below), the physical plan falls back to > sort merge join. But when the join is on the large DF side, the broadcast > does take place. Is there a good reason for this? In the below example it > sure doesn't make any sense to shuffle the entire large table: > > {code:java} > val small = spark.range(1, 10) > val big = spark.range(1, 1 << 30) > .withColumnRenamed("id", "id2") > big.join(broadcast(small), $"id" === $"id2", "right") > .explain > //OUTPUT: > == Physical Plan == > SortMergeJoin [id2#16307L], [id#16310L], RightOuter > :- *Sort [id2#16307L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(id2#16307L, 1000) > : +- *Project [id#16304L AS id2#16307L] > : +- *Range (1, 1073741824, step=1, splits=Some(600)) > +- *Sort [id#16310L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16310L, 1000) > +- *Range (1, 10, step=1, splits=Some(600)) > {code} > As a workaround, users need to perform inner instead of right join, and then > join the result back with the small DF to fill the missing rows. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data
[ https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555842#comment-16555842 ] Marco Gaido commented on SPARK-24904: - [~shay_elbaz] In the case I mentioned before the approach you proposed is not better, it is worse, as it requires an unneeded additional broadcast join. > Join with broadcasted dataframe causes shuffle of redundant data > > > Key: SPARK-24904 > URL: https://issues.apache.org/jira/browse/SPARK-24904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.2 >Reporter: Shay Elbaz >Priority: Minor > > When joining a "large" dataframe with broadcasted small one, and join-type is > on the small DF side (see right-join below), the physical plan falls back to > sort merge join. But when the join is on the large DF side, the broadcast > does take place. Is there a good reason for this? In the below example it > sure doesn't make any sense to shuffle the entire large table: > > {code:java} > val small = spark.range(1, 10) > val big = spark.range(1, 1 << 30) > .withColumnRenamed("id", "id2") > big.join(broadcast(small), $"id" === $"id2", "right") > .explain > //OUTPUT: > == Physical Plan == > SortMergeJoin [id2#16307L], [id#16310L], RightOuter > :- *Sort [id2#16307L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(id2#16307L, 1000) > : +- *Project [id#16304L AS id2#16307L] > : +- *Range (1, 1073741824, step=1, splits=Some(600)) > +- *Sort [id#16310L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16310L, 1000) > +- *Range (1, 10, step=1, splits=Some(600)) > {code} > As a workaround, users need to perform inner instead of right join, and then > join the result back with the small DF to fill the missing rows. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data
[ https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555652#comment-16555652 ] Marco Gaido commented on SPARK-24904: - I see now what you mean, but yes, It think there is an assumption you are doing which is not always true, ie. "The output is (expected to be) very small compared to the big table". That is not true. If all the rows from the big table match the small one, this is not the case. We may trying to do something like what you mentioned in the optimizer if CBO is enabled and we have good enough statistics about the output size of the inner join, but i am not sure. > Join with broadcasted dataframe causes shuffle of redundant data > > > Key: SPARK-24904 > URL: https://issues.apache.org/jira/browse/SPARK-24904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.2 >Reporter: Shay Elbaz >Priority: Minor > > When joining a "large" dataframe with broadcasted small one, and join-type is > on the small DF side (see right-join below), the physical plan falls back to > sort merge join. But when the join is on the large DF side, the broadcast > does take place. Is there a good reason for this? In the below example it > sure doesn't make any sense to shuffle the entire large table: > > {code:java} > val small = spark.range(1, 10) > val big = spark.range(1, 1 << 30) > .withColumnRenamed("id", "id2") > big.join(broadcast(small), $"id" === $"id2", "right") > .explain > //OUTPUT: > == Physical Plan == > SortMergeJoin [id2#16307L], [id#16310L], RightOuter > :- *Sort [id2#16307L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(id2#16307L, 1000) > : +- *Project [id#16304L AS id2#16307L] > : +- *Range (1, 1073741824, step=1, splits=Some(600)) > +- *Sort [id#16310L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16310L, 1000) > +- *Range (1, 10, step=1, splits=Some(600)) > {code} > As a workaround, users need to perform inner instead of right join, and then > join the result back with the small DF to fill the missing rows. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data
[ https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555477#comment-16555477 ] Marco Gaido commented on SPARK-24904: - You cannot do a broadcast join when it is on the side of the small table, as the join requires to compare each row of the small table with the whole big table and output it into the result if it is not met. Since the big table is available only in small pieces in each task, no task can determine whether the row matched at least once (as it doesn't know what other tasks did). > Join with broadcasted dataframe causes shuffle of redundant data > > > Key: SPARK-24904 > URL: https://issues.apache.org/jira/browse/SPARK-24904 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.1.2 >Reporter: Shay Elbaz >Priority: Minor > > When joining a "large" dataframe with broadcasted small one, and join-type is > on the small DF side (see right-join below), the physical plan does not > include broadcasting the small table. But when the join is on the large DF > side, the broadcast does take place. Is there a good reason for this? In the > below example it sure doesn't make any sense to shuffle the entire large > table: > > {code:java} > val small = spark.range(1, 10) > val big = spark.range(1, 1 << 30) > .withColumnRenamed("id", "id2") > big.join(broadcast(small), $"id" === $"id2", "right") > .explain > //OUTPUT: > == Physical Plan == > SortMergeJoin [id2#16307L], [id#16310L], RightOuter > :- *Sort [id2#16307L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(id2#16307L, 1000) > : +- *Project [id#16304L AS id2#16307L] > : +- *Range (1, 1073741824, step=1, splits=Some(600)) > +- *Sort [id#16310L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16310L, 1000) > +- *Range (1, 10, step=1, splits=Some(600)) > {code} > As a workaround, users need to perform inner instead of right join, and then > join the result back with the small DF to fill the missing rows. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543336#comment-16543336 ] Marco Gaido commented on SPARK-24498: - [~maropu] yes, I remembered I had some troubles compiling the generated code with the jdk compiler too. There is also one case (which I saw in the branch you prepared you addressed generating the proper code according to the chosen compiler) in which there isn't really a way to make both them happy. In other cases, when there is a form which works fine on both, I think it would be great to use it. So I agree with your proposal. My only concern is that as of now we have no way to check the compilation with JDK, so it would be probably hard to enforce that we correct all the problems and/or we don't introduce new ones. So the risk is that the effort spent on that task could be not so useful... > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24782) Simplify conf access in expressions
Marco Gaido created SPARK-24782: --- Summary: Simplify conf access in expressions Key: SPARK-24782 URL: https://issues.apache.org/jira/browse/SPARK-24782 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Marco Gaido Previously, we were not able to access configs on executor side. This lead to some workarounds for getting the right configuration on the driver and send them to the executors when dealing with SQL expressions. As these workarounds are not needed anymore, we can remove them and simplify the was SQLConf are accessed by them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24268) DataType in error messages are not coherent
[ https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24268: Description: In SPARK-22893 there was a tentative to unify the way dataTypes are reported in error messages. There, we decided to use always {{dataType.simpleString}}. Unfortunately, we missed many places where this still needed to be fixed. Moreover, it turns out that the right method to use is not {{simpleString}}, but we should use {{catalogString}} instead (for further details please check the discussion in the PR https://github.com/apache/spark/pull/21321). So we should update all the missing places in order to provide error messages coherently throughout the project. was: In SPARK-22893 there was a tentative to unify the way dataTypes are reported in error messages. There, we decided to use always {{dataType.simpleString}}. Unfortunately, we missed many places where this still needed to be fixed. Moreover, it turns out that the right method to use is not {{simpleString}}, but we should use {{catalogString}} instead (for further details please check the discussion in the PR ). So we should update all the missing places in order to provide error messages coherently throughout the project. > DataType in error messages are not coherent > --- > > Key: SPARK-24268 > URL: https://issues.apache.org/jira/browse/SPARK-24268 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > > In SPARK-22893 there was a tentative to unify the way dataTypes are reported > in error messages. There, we decided to use always {{dataType.simpleString}}. > Unfortunately, we missed many places where this still needed to be fixed. > Moreover, it turns out that the right method to use is not {{simpleString}}, > but we should use {{catalogString}} instead (for further details please check > the discussion in the PR https://github.com/apache/spark/pull/21321). > So we should update all the missing places in order to provide error messages > coherently throughout the project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24268) DataType in error messages are not coherent
[ https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24268: Description: In SPARK-22893 there was a tentative to unify the way dataTypes are reported in error messages. There, we decided to use always {{dataType.simpleString}}. Unfortunately, we missed many places where this still needed to be fixed. Moreover, it turns out that the right method to use is not {{simpleString}}, but we should use {{catalogString}} instead (for further details please check the discussion in the PR ). So we should update all the missing places in order to provide error messages coherently throughout the project. was: In SPARK-22893 there was a tentative to unify the way dataTypes are reported in error messages. There, we decided to use always {{dataType.simpleString}}. Unfortunately, we missed many places where this still needed to be fixed. Moreover, it turns out that the right method to use is not {{simpleString}}, but we should use {{catalogString}} instead. So we should update all the missing places in order to provide error messages coherently throughout the project. > DataType in error messages are not coherent > --- > > Key: SPARK-24268 > URL: https://issues.apache.org/jira/browse/SPARK-24268 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > > In SPARK-22893 there was a tentative to unify the way dataTypes are reported > in error messages. There, we decided to use always {{dataType.simpleString}}. > Unfortunately, we missed many places where this still needed to be fixed. > Moreover, it turns out that the right method to use is not {{simpleString}}, > but we should use {{catalogString}} instead (for further details please check > the discussion in the PR ). > So we should update all the missing places in order to provide error messages > coherently throughout the project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24268) DataType in error messages are not coherent
[ https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24268: Description: In SPARK-22893 there was a tentative to unify the way dataTypes are reported in error messages. There, we decided to use always {{dataType.simpleString}}. Unfortunately, we missed many places where this still needed to be fixed. Moreover, it turns out that the right method to use is not {{simpleString}}, but we should use {{catalogString}} instead. So we should update all the missing places in order to provide error messages coherently throughout the project. was: In SPARK-22893 there was a tentative to unify the way dataTypes are reported in error messages. There, we decided to use always {{dataType.simpleString}}. Unfortunately, we missed many places where this still needed to be fixed. So we should update all the missing places in order to provide error messages coherently throughout the project. > DataType in error messages are not coherent > --- > > Key: SPARK-24268 > URL: https://issues.apache.org/jira/browse/SPARK-24268 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > > In SPARK-22893 there was a tentative to unify the way dataTypes are reported > in error messages. There, we decided to use always {{dataType.simpleString}}. > Unfortunately, we missed many places where this still needed to be fixed. > Moreover, it turns out that the right method to use is not {{simpleString}}, > but we should use {{catalogString}} instead. > So we should update all the missing places in order to provide error messages > coherently throughout the project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24745) Map function does not keep rdd name
[ https://issues.apache.org/jira/browse/SPARK-24745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538260#comment-16538260 ] Marco Gaido commented on SPARK-24745: - A RDD already has a unique ID. I think the name is just useful for the UI/debug, but if you want to use it in your application you can still set the name also on the RDD you create from mapping the original RDD or you can create your own RDD implementation which retrieves the name from any ancestor. > Map function does not keep rdd name > > > Key: SPARK-24745 > URL: https://issues.apache.org/jira/browse/SPARK-24745 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Igor Pergenitsa >Priority: Minor > > This snippet > {code:scala} > val namedRdd = sparkContext.makeRDD(List("abc", "123")).setName("named_rdd") > println(namedRdd.name) > val mappedRdd = namedRdd.map(_.length) > println(mappedRdd.name){code} > outputs: > named_rdd > null -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24745) Map function does not keep rdd name
[ https://issues.apache.org/jira/browse/SPARK-24745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537199#comment-16537199 ] Marco Gaido commented on SPARK-24745: - This makes sense, as the map operation creates a new RDD. So the new RDD has no name. > Map function does not keep rdd name > > > Key: SPARK-24745 > URL: https://issues.apache.org/jira/browse/SPARK-24745 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Igor Pergenitsa >Priority: Minor > > This snippet > {code:scala} > val namedRdd = sparkContext.makeRDD(List("abc", "123")).setName("named_rdd") > println(namedRdd.name) > val mappedRdd = namedRdd.map(_.length) > println(mappedRdd.name){code} > outputs: > named_rdd > null -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24719) ClusteringEvaluator supports integer type labels
[ https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536747#comment-16536747 ] Marco Gaido commented on SPARK-24719: - [~mengxr] any luck with this? Thanks. > ClusteringEvaluator supports integer type labels > > > Key: SPARK-24719 > URL: https://issues.apache.org/jira/browse/SPARK-24719 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.1 >Reporter: Xiangrui Meng >Priority: Major > > ClusterEvaluator should support integer labels because we output integer > labels in BisectingKMeans. > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77]. > We should cast numeric types to double in ClusteringEvaluator. > [~mgaido] Do you have time to work on the fix? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition
[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536682#comment-16536682 ] Marco Gaido commented on SPARK-24438: - IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value in partitions. > Empty strings and null strings are written to the same partition > > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24438) Empty strings and null strings are written to the same partition
[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536682#comment-16536682 ] Marco Gaido edited comment on SPARK-24438 at 7/9/18 8:37 AM: - IIRC, Hive has a placeholder string (\_\_HIVE_DEFAULT_PARTITION\_\_) for null value in partitions. was (Author: mgaido): IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value in partitions. > Empty strings and null strings are written to the same partition > > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (YARN-8385) Clean local directories when a container is killed
[ https://issues.apache.org/jira/browse/YARN-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536677#comment-16536677 ] Marco Gaido commented on YARN-8385: --- Thanks for your answer [~jlowe]. As it is stated in the question on SO (https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache) I think the application directory is used. I see why the data is not removed by YARN from you comment above, though. So I think we have to investigate why Spark is using the application directory in this case. Thanks. > Clean local directories when a container is killed > -- > > Key: YARN-8385 > URL: https://issues.apache.org/jira/browse/YARN-8385 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Marco Gaido >Priority: Major > > In long running applications, it may happen that many containers are created > and killed. A use case is Spark Thrift Server when dynamic allocation is > enabled. A lot of containers are killed and the application keeps running > indefinitely. > Currently, YARN seems to remove the local directories only when the whole > application terminates. In the scenario described above, this can cause > serious resource leakages. Please, check > https://issues.apache.org/jira/browse/SPARK-22575. > I think YARN should clean up all the local directories of a container when it > is killed and not when the whole application terminates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (KNOX-1362) Add documentation for the interaction with Spark History Server (SHS)
[ https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531989#comment-16531989 ] Marco Gaido commented on KNOX-1362: --- Thanks for your work [~smore]. Sure, no worries. Thank you. > Add documentation for the interaction with Spark History Server (SHS) > - > > Key: KNOX-1362 > URL: https://issues.apache.org/jira/browse/KNOX-1362 > Project: Apache Knox > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 1.1.0 > > Attachments: KNOX-1362.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-24719) ClusteringEvaluator supports integer type labels
[ https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530493#comment-16530493 ] Marco Gaido commented on SPARK-24719: - [~mengxr] I tried to pass integer values in the prediction column and I was not able to reproduce any issue (I tried both distance measures). I also checked the code and the prediction column is casted to double where needed. Can you provide a repro if you faced any issue? If that is not the case, is this JIRA meant for doing a small refactor which makes the casting more clear? Thanks. > ClusteringEvaluator supports integer type labels > > > Key: SPARK-24719 > URL: https://issues.apache.org/jira/browse/SPARK-24719 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.1 >Reporter: Xiangrui Meng >Priority: Major > > ClusterEvaluator should support integer labels because we output integer > labels in BisectingKMeans. > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77]. > We should cast numeric types to double in ClusteringEvaluator. > [~mgaido] Do you have time to work on the fix? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24719) ClusteringEvaluator supports integer type labels
[ https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530264#comment-16530264 ] Marco Gaido commented on SPARK-24719: - Sure,thanks. I'll submit a PR ASAP. > ClusteringEvaluator supports integer type labels > > > Key: SPARK-24719 > URL: https://issues.apache.org/jira/browse/SPARK-24719 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.1 >Reporter: Xiangrui Meng >Priority: Major > > ClusterEvaluator should support integer labels because we output integer > labels in BisectingKMeans. > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77]. > We should cast numeric types to double in ClusteringEvaluator. > [~mgaido] Do you have time to work on the fix? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24712) TrainValidationSplit ignores label column name and forces to be "label"
[ https://issues.apache.org/jira/browse/SPARK-24712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-24712. - Resolution: Not A Problem > TrainValidationSplit ignores label column name and forces to be "label" > --- > > Key: SPARK-24712 > URL: https://issues.apache.org/jira/browse/SPARK-24712 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Pablo J. Villacorta >Priority: Major > > When a TrainValidationSplit is fit on a Pipeline containing a ML model, the > labelCol property of the model is ignored, and the call to fit() will fail > unless the labelCol equals "label". As an example, the following pyspark code > only works when the variable labelColumn is set to "label" > {code:java} > from pyspark.sql.functions import rand, randn > from pyspark.ml.regression import LinearRegression > labelColumn = "target" # CHANGE THIS TO "label" AND THE CODE WORKS > df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), > randn(seed=27).alias(labelColumn)) > vectorAssembler = > VectorAssembler().setInputCols(["uniform"]).setOutputCol("features") > lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn) > mypipeline = Pipeline(stages = [vectorAssembler, lr]) > paramGrid = ParamGridBuilder()\ > .addGrid(lr.regParam, [0.01, 0.1])\ > .build() > trainValidationSplit = TrainValidationSplit()\ > .setEstimator(mypipeline)\ > .setEvaluator(RegressionEvaluator())\ > .setEstimatorParamMaps(paramGrid)\ > .setTrainRatio(0.8) > trainValidationSplit.fit(df) # FAIL UNLESS labelColumn IS SET TO "label" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24712) TrainValidationSplit ignores label column name and forces to be "label"
[ https://issues.apache.org/jira/browse/SPARK-24712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529746#comment-16529746 ] Marco Gaido commented on SPARK-24712: - The problem is that you have not set the label on the evaluator you are passing to {{TrainValidationSplit}}. Please set it there and it will work. I am closing this, feel free to reopen if you face a problem. > TrainValidationSplit ignores label column name and forces to be "label" > --- > > Key: SPARK-24712 > URL: https://issues.apache.org/jira/browse/SPARK-24712 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Pablo J. Villacorta >Priority: Major > > When a TrainValidationSplit is fit on a Pipeline containing a ML model, the > labelCol property of the model is ignored, and the call to fit() will fail > unless the labelCol equals "label". As an example, the following pyspark code > only works when the variable labelColumn is set to "label" > {code:java} > from pyspark.sql.functions import rand, randn > from pyspark.ml.regression import LinearRegression > labelColumn = "target" # CHANGE THIS TO "label" AND THE CODE WORKS > df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), > randn(seed=27).alias(labelColumn)) > vectorAssembler = > VectorAssembler().setInputCols(["uniform"]).setOutputCol("features") > lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn) > mypipeline = Pipeline(stages = [vectorAssembler, lr]) > paramGrid = ParamGridBuilder()\ > .addGrid(lr.regParam, [0.01, 0.1])\ > .build() > trainValidationSplit = TrainValidationSplit()\ > .setEstimator(mypipeline)\ > .setEvaluator(RegressionEvaluator())\ > .setEstimatorParamMaps(paramGrid)\ > .setTrainRatio(0.8) > trainValidationSplit.fit(df) # FAIL UNLESS labelColumn IS SET TO "label" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24208) Cannot resolve column in self join after applying Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-24208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525210#comment-16525210 ] Marco Gaido commented on SPARK-24208: - I think this may be a duplicate of SPARK-24373. Can you try 2.3.1? > Cannot resolve column in self join after applying Pandas UDF > > > Key: SPARK-24208 > URL: https://issues.apache.org/jira/browse/SPARK-24208 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: AWS EMR 5.13.0 > Amazon Hadoop distribution 2.8.3 > Spark 2.3.0 > Pandas 0.22.0 >Reporter: Rafal Ganczarek >Priority: Minor > > I noticed that after applying Pandas UDF function, a self join of resulted > DataFrame will fail to resolve columns. The workaround that I found is to > recreate DataFrame with its RDD and schema. > Below you can find a Python code that reproduces the issue. > {code:java} > from pyspark import Row > import pyspark.sql.functions as F > @F.pandas_udf('key long, col string', F.PandasUDFType.GROUPED_MAP) > def dummy_pandas_udf(df): > return df[['key','col']] > df = spark.createDataFrame([Row(key=1,col='A'), Row(key=1,col='B'), > Row(key=2,col='C')]) > # transformation that causes the issue > df = df.groupBy('key').apply(dummy_pandas_udf) > # WORKAROUND that fixes the issue > # df = spark.createDataFrame(df.rdd, df.schema) > df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == > F.col('temp1.key')).show() > {code} > If workaround line is commented out, then above code fails with the following > error: > {code:java} > AnalysisExceptionTraceback (most recent call last) > in () > 12 # df = spark.createDataFrame(df.rdd, df.schema) > 13 > ---> 14 df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == > F.col('temp1.key')).show() > /usr/lib/spark/python/pyspark/sql/dataframe.py in join(self, other, on, how) > 929 on = self._jseq([]) > 930 assert isinstance(how, basestring), "how should be > basestring" > --> 931 jdf = self._jdf.join(other._jdf, on, how) > 932 return DataFrame(jdf, self.sql_ctx) > 933 > /usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, > *args) >1158 answer = self.gateway_client.send_command(command) >1159 return_value = get_return_value( > -> 1160 answer, self.gateway_client, self.target_id, self.name) >1161 >1162 for temp_arg in temp_args: > /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 67 > e.java_exception.getStackTrace())) > 68 if s.startswith('org.apache.spark.sql.AnalysisException: > '): > ---> 69 raise AnalysisException(s.split(': ', 1)[1], > stackTrace) > 70 if s.startswith('org.apache.spark.sql.catalyst.analysis'): > 71 raise AnalysisException(s.split(': ', 1)[1], > stackTrace) > AnalysisException: u"cannot resolve '`temp0.key`' given input columns: > [temp0.key, temp0.col];;\n'Join Inner, ('temp0.key = 'temp1.key)\n:- > AnalysisBarrier\n: +- SubqueryAlias temp0\n:+- > FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), > [key#4104L, col#4105]\n: +- Project [key#4099L, col#4098, > key#4099L]\n: +- LogicalRDD [col#4098, key#4099L], false\n+- > AnalysisBarrier\n +- SubqueryAlias temp1\n +- > FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), > [key#4104L, col#4105]\n+- Project [key#4099L, col#4098, > key#4099L]\n +- LogicalRDD [col#4098, key#4099L], false\n" > {code} > The same happens, if instead of DataFrame API I use Spark SQL to do a self > join: > {code:java} > # df is a DataFrame after applying dummy_pandas_udf > df.createOrReplaceTempView('df') > spark.sql(''' > SELECT > * > FROM df temp0 > LEFT JOIN df temp1 ON > temp0.key == temp1.key > ''').show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24660) SHS is not showing properly errors when downloading logs
Marco Gaido created SPARK-24660: --- Summary: SHS is not showing properly errors when downloading logs Key: SPARK-24660 URL: https://issues.apache.org/jira/browse/SPARK-24660 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1 Reporter: Marco Gaido The History Server is not showing properly errors which happen when trying to download logs. In particular, when downloading logs for which the user is not authorized, the user sees a File not found error, instead of the unauthorized response. Similarly, trying to download logs from a non-existing application returns a server error, instead of a 404 message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (KNOX-1362) Add documentation for the interaction with Spark History Server (SHS)
[ https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated KNOX-1362: -- Attachment: KNOX-1362.patch > Add documentation for the interaction with Spark History Server (SHS) > - > > Key: KNOX-1362 > URL: https://issues.apache.org/jira/browse/KNOX-1362 > Project: Apache Knox > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Marco Gaido >Priority: Major > Attachments: KNOX-1362.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KNOX-1362) Add documentation for the interaction with Spark History Server (SHS)
[ https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520314#comment-16520314 ] Marco Gaido commented on KNOX-1362: --- Thank you [~smore]! > Add documentation for the interaction with Spark History Server (SHS) > - > > Key: KNOX-1362 > URL: https://issues.apache.org/jira/browse/KNOX-1362 > Project: Apache Knox > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Marco Gaido >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520125#comment-16520125 ] Marco Gaido commented on SPARK-24498: - Thanks for your great analysis [~maropu]! Very interesting. Seems like there is no advantage in introducing a new compiler. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (KNOX-1362) Add documentation for the interaction with SHS
[ https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519036#comment-16519036 ] Marco Gaido commented on KNOX-1362: --- [~lmccay] I created the issue as you suggested in KNOX-1354. Unfortunately, though, I cannot find where the doc is in order to provide a patch for it. May you help me with this? Thanks. > Add documentation for the interaction with SHS > -- > > Key: KNOX-1362 > URL: https://issues.apache.org/jira/browse/KNOX-1362 > Project: Apache Knox > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Marco Gaido >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KNOX-1362) Add documentation for the interaction with SHS
Marco Gaido created KNOX-1362: - Summary: Add documentation for the interaction with SHS Key: KNOX-1362 URL: https://issues.apache.org/jira/browse/KNOX-1362 Project: Apache Knox Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Marco Gaido -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KNOX-1315) Spark UI urls issue: Jobs, stdout/stderr and threadDump links
[ https://issues.apache.org/jira/browse/KNOX-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519032#comment-16519032 ] Marco Gaido commented on KNOX-1315: --- [~lmccay] this is actually a patch on YARN UI. As I am not an expert in that area, may someone else take a look at it? Thanks. > Spark UI urls issue: Jobs, stdout/stderr and threadDump links > - > > Key: KNOX-1315 > URL: https://issues.apache.org/jira/browse/KNOX-1315 > Project: Apache Knox > Issue Type: Bug >Affects Versions: 0.14.0, 1.0.0 >Reporter: Guang Yang >Assignee: Guang Yang >Priority: Major > Fix For: 1.1.0 > > Attachments: KNOX-1315.patch > > > When users get to the SPARK UI page by clicking on the *{{Application > Master}}* link the Yarn application page for running applications, the link > for *Individual job* doesn't work. Also, if users go to *executors* page, the > stdout/stderr and threadDump link don't work as well. > The above issues are at this page > https://host:port/gateway/sandbox/yarn/proxy/application_1525479109400_910288 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518196#comment-16518196 ] Marco Gaido commented on SPARK-24607: - [~viirya] please check the description in the Hive ticket. This happens when there are task failures. I have not tried to reproduce and check whether Spark is affected too, but it may be. > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, I think it's also an hidden > serious problem in sparksql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24606) Decimals multiplication and division may be null due to the result precision overflow
[ https://issues.apache.org/jira/browse/SPARK-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24606: Priority: Major (was: Blocker) > Decimals multiplication and division may be null due to the result precision > overflow > - > > Key: SPARK-24606 > URL: https://issues.apache.org/jira/browse/SPARK-24606 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Yan Jian >Priority: Major > > Spark performs mul / div on Decimals via Java's BigDecimal, whose scale may > be greater than its precision, with 38 precision limit. > If the result BigDecimal's precision is 38, and its scale is greater than 38 > ( 39 e.g. ), the converted decimal (in spark SQL) is in precision of 40 ( = > 39 + 1, and > 38 ). > > Run following SQLs to reproduce this: > {code:sql} > select (cast (1.0 as decimal(38,37))) * 1.8; > select (cast (0.07654387654321 as decimal(38,37))) / > 99; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24606) Decimals multiplication and division may be null due to the result precision overflow
[ https://issues.apache.org/jira/browse/SPARK-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518130#comment-16518130 ] Marco Gaido commented on SPARK-24606: - Critical and Blocker are reserved for committers. Closing as this is a duplicate. Thanks. > Decimals multiplication and division may be null due to the result precision > overflow > - > > Key: SPARK-24606 > URL: https://issues.apache.org/jira/browse/SPARK-24606 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Yan Jian >Priority: Major > > Spark performs mul / div on Decimals via Java's BigDecimal, whose scale may > be greater than its precision, with 38 precision limit. > If the result BigDecimal's precision is 38, and its scale is greater than 38 > ( 39 e.g. ), the converted decimal (in spark SQL) is in precision of 40 ( = > 39 + 1, and > 38 ). > > Run following SQLs to reproduce this: > {code:sql} > select (cast (1.0 as decimal(38,37))) * 1.8; > select (cast (0.07654387654321 as decimal(38,37))) / > 99; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24606) Decimals multiplication and division may be null due to the result precision overflow
[ https://issues.apache.org/jira/browse/SPARK-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-24606. - Resolution: Duplicate > Decimals multiplication and division may be null due to the result precision > overflow > - > > Key: SPARK-24606 > URL: https://issues.apache.org/jira/browse/SPARK-24606 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Yan Jian >Priority: Blocker > > Spark performs mul / div on Decimals via Java's BigDecimal, whose scale may > be greater than its precision, with 38 precision limit. > If the result BigDecimal's precision is 38, and its scale is greater than 38 > ( 39 e.g. ), the converted decimal (in spark SQL) is in precision of 40 ( = > 39 + 1, and > 38 ). > > Run following SQLs to reproduce this: > {code:sql} > select (cast (1.0 as decimal(38,37))) * 1.8; > select (cast (0.07654387654321 as decimal(38,37))) / > 99; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514687#comment-16514687 ] Marco Gaido commented on SPARK-23901: - These functions can be used as any other function in Hive, they are not just there for the Hive authorizer. I think the use case for them is to anonymize data for privacy reasons (eg. expose/export to other parties data without providing sensible data, but still being able to use them in joins). > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (KNOX-1358) Create new version definition for SHS
[ https://issues.apache.org/jira/browse/KNOX-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated KNOX-1358: -- Attachment: KNOX-1358.patch > Create new version definition for SHS > - > > Key: KNOX-1358 > URL: https://issues.apache.org/jira/browse/KNOX-1358 > Project: Apache Knox > Issue Type: New Feature >Reporter: Marco Gaido >Priority: Major > Attachments: KNOX-1358.patch > > > As SHS is now leveraging the opportunity to have support to > X-Forwarded-Context and fixed some issues with the UI when behind a proxy, we > can provide a service definition for newer SHS version exploiting those > features without keeping on working on the old service definitions which is > there since version 1.4.0 of Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KNOX-1358) Create new version definition for SHS
Marco Gaido created KNOX-1358: - Summary: Create new version definition for SHS Key: KNOX-1358 URL: https://issues.apache.org/jira/browse/KNOX-1358 Project: Apache Knox Issue Type: New Feature Reporter: Marco Gaido As SHS is now leveraging the opportunity to have support to X-Forwarded-Context and fixed some issues with the UI when behind a proxy, we can provide a service definition for newer SHS version exploiting those features without keeping on working on the old service definitions which is there since version 1.4.0 of Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KNOX-1353) SHS always showing link to incomplete applications
[ https://issues.apache.org/jira/browse/KNOX-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513503#comment-16513503 ] Marco Gaido commented on KNOX-1353: --- Sorry [~lmccay], I'll be more careful next time. Thanks. > SHS always showing link to incomplete applications > -- > > Key: KNOX-1353 > URL: https://issues.apache.org/jira/browse/KNOX-1353 > Project: Apache Knox > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 1.1.0 > > Attachments: KNOX-1353.patch > > > SHS is always showing the link to "Show incomplete applications", also when > it is showing the incomplete applications. Instead there it should show the > link "Back to completed applications". > The reason of this behavior is that the URL is not rewritten correctly and > the parameter {{?showIncomplete=true}} in the URL is getting lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SPARK-24562) Allow running same tests with multiple configs in SQLQueryTestSuite
Marco Gaido created SPARK-24562: --- Summary: Allow running same tests with multiple configs in SQLQueryTestSuite Key: SPARK-24562 URL: https://issues.apache.org/jira/browse/SPARK-24562 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 2.4.0 Reporter: Marco Gaido We often need to run the same queries with different configs in order to check their behavior in any condition. In particular, we have 2 cases: - same queries with different configs should have same result; - same queries with different configs should have different results. This ticket aims to introduce the support for both cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org