[jira] [Commented] (SPARK-25364) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-07 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607019#comment-16607019
 ] 

Marco Gaido commented on SPARK-25364:
-

Seems you created 2 JIRAs which are the same, if that is the case, can you 
close this or the next one? Thanks.

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25364
> URL: https://issues.apache.org/jira/browse/SPARK-25364
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-05 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604133#comment-16604133
 ] 

Marco Gaido commented on SPARK-25317:
-

[~kiszk] sure, we can investigate further in the PR the root cause. Thanks.

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-04 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603174#comment-16603174
 ] 

Marco Gaido commented on SPARK-25317:
-

I think I have a fix for this. I can submit a PR if you want, but I am still 
not sure about the root cause of the regression. My best guess is that there 
are more than one reason and the perf improvement happens iff all the reasons 
are fixed, which is rather strange to me.

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (LIVY-506) Dedicated thread for timeout checker

2018-09-03 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-506:


 Summary: Dedicated thread for timeout checker
 Key: LIVY-506
 URL: https://issues.apache.org/jira/browse/LIVY-506
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido


The timeout checker task currently uses the backgroud pool of threads. Since 
the thread is always alive, this doesn't make a lot of sense as it is using 
forever a thread from the pool. It should use, instead, its dedicated thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25265) Fix memory leak vulnerability in Barrier Execution Mode

2018-08-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596221#comment-16596221
 ] 

Marco Gaido commented on SPARK-25265:
-

Isn't this a duplicate of the next one?

> Fix memory leak vulnerability in Barrier Execution Mode
> ---
>
> Key: SPARK-25265
> URL: https://issues.apache.org/jira/browse/SPARK-25265
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
>
> BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

2018-08-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596109#comment-16596109
 ] 

Marco Gaido commented on SPARK-25219:
-

Well, there are many differences between Spark ML and SKLearn codes you've 
posted. First of all the number of clusters is different. Moreover the input 
data to KMeans can be different.

Please store the data after the TF-IDF transformation, which is the interesting 
one. Then, take the KMeans results and the centroids: check if the distance of 
a point to the centroid it has been assigned to is lower than the distance to 
all the other centroids. If that is the case, there is no issue with KMeans, 
You may have to increase the number of runs, change the initialization method, 
change the seed and so on to get a different result, but there is no evident 
bug in the algorithm itself. If this is not the case, instead, with the input 
data to the KMeans and the reproducer, I can investigate the problem. Thanks.

> KMeans Clustering - Text Data - Results are incorrect
> -
>
> Key: SPARK-25219
> URL: https://issues.apache.org/jira/browse/SPARK-25219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Vasanthkumar Velayudham
>Priority: Major
> Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, 
> Spark_Kmeans.txt
>
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23622) Flaky Test: HiveClientSuites

2018-08-28 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595068#comment-16595068
 ] 

Marco Gaido commented on SPARK-23622:
-

This failure became permanent in the last build (at least seems so).

> Flaky Test: HiveClientSuites
> 
>
> Key: SPARK-23622
> URL: https://issues.apache.org/jira/browse/SPARK-23622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test 
> (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325
> {code}
> Error Message
> java.lang.reflect.InvocationTargetException: null
> Stacktrace
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
>   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58)
>   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
>   at org.scalatest.Suite$class.run(Suite.scala:1144)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117)
>   ... 29 more
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425)
>   ... 31 more
> Caused by: sbt.ForkMain$ForkError: 
> 

[jira] [Created] (LIVY-503) More RPC classes used in thrifserver in a separate module

2018-08-28 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-503:


 Summary: More RPC classes used in thrifserver in a separate module
 Key: LIVY-503
 URL: https://issues.apache.org/jira/browse/LIVY-503
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido


As suggested in the discussion for the original PR 
(https://github.com/apache/incubator-livy/pull/104#discussion_r212806490), we 
should move the RPC classes which need to be uploaded to the Spark session in a 
separate module, in order to upload as few classes as possible and avoid 
eventual interaction with the Spark session created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-502) Cleanup Hive dependencies

2018-08-28 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-502:


 Summary: Cleanup Hive dependencies
 Key: LIVY-502
 URL: https://issues.apache.org/jira/browse/LIVY-502
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido


In the starting implementation we are relying/delegating some of the work to 
the Hive classes used in the HiveServer2. This helped simplifying the creation 
of the first implementation, as it saved to write a lot of code. But this 
caused also a dependency on the {{hive-exec}} package, as well as compelled us 
to modify a bit some of the existing Hive classes.

The JIRA tracks removing these workarounds by re-implementing the same logic in 
Livy to get rid of all Hive dependencies, other than the rpc and service layers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2018-08-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592609#comment-16592609
 ] 

Marco Gaido commented on SPARK-25193:
-

Well, this I think is HIVE-12505. So it would need to be fixed in the hive 
version which is shipped in Spark...

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

2018-08-24 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591423#comment-16591423
 ] 

Marco Gaido commented on SPARK-25219:
-

Hi [~VVasanth], a JIRA like this is very difficult to work on: saying that 
something returns a result which is not the expected one is not a great 
starting point for taking an action.

It would be great if you could provide a simple reproducer. The reproducer 
needs to involve only one thing if possible (in this case KMeans, not involving 
other transformation), with a set of parameters to reproduce the problem and 
the expected result which is returned with the same parameters by the other 
libraries.

If the problem is more clear, I am happy to work on it, but first we need to 
understand whether this is indeed an issue and how to reproduce it. Thanks.

> KMeans Clustering - Text Data - Results are incorrect
> -
>
> Key: SPARK-25219
> URL: https://issues.apache.org/jira/browse/SPARK-25219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Vasanthkumar Velayudham
>Priority: Major
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

2018-08-24 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25219:

Component/s: (was: Spark Submit)
 ML

> KMeans Clustering - Text Data - Results are incorrect
> -
>
> Key: SPARK-25219
> URL: https://issues.apache.org/jira/browse/SPARK-25219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Vasanthkumar Velayudham
>Priority: Major
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25146) avg() returns null on some decimals

2018-08-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584016#comment-16584016
 ] 

Marco Gaido commented on SPARK-25146:
-

No problem, thanks for reporting this anyway.

> avg() returns null on some decimals
> ---
>
> Key: SPARK-25146
> URL: https://issues.apache.org/jira/browse/SPARK-25146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Daniel Darabos
>Priority: Major
>
> We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
> them. The average in some cases comes out to {{null}} to our surprise (and 
> disappointment).
> After a bit of digging it looks like these numbers have ended up with the 
> {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
> this type:
> {code}
> scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")
> scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
> x").createOrReplaceTempView("x")
> scala> spark.sql("select avg(v) from x").show
> +--+
> |avg(v)|
> +--+
> |  null|
> +--+
> {code}
> For up to 4471 numbers it is able to calculate the average. For 4472 or more 
> numbers it's {{null}}.
> Now I'll just change these numbers to {{double}}. But we got the types 
> entirely automatically. We never asked for {{decimal}}. If this is the 
> default type, it's important to support averaging a handful of them. (Sorry 
> for the bitterness. I like {{double}} more. :))
> Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
> that {{avg()}} fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25146) avg() returns null on some decimals

2018-08-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583936#comment-16583936
 ] 

Marco Gaido commented on SPARK-25146:
-

This has been fixed by SPARK-24957. On the current master doesn't repro. I can 
only advice upgrading to 2.3.2 or 2.4.0 once they are available (probably not 
too much).

I am closing this as a duplicate. Please reopen if anything else is needed. 
Thanks.

> avg() returns null on some decimals
> ---
>
> Key: SPARK-25146
> URL: https://issues.apache.org/jira/browse/SPARK-25146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Daniel Darabos
>Priority: Major
>
> We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
> them. The average in some cases comes out to {{null}} to our surprise (and 
> disappointment).
> After a bit of digging it looks like these numbers have ended up with the 
> {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
> this type:
> {code}
> scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")
> scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
> x").createOrReplaceTempView("x")
> scala> spark.sql("select avg(v) from x").show
> +--+
> |avg(v)|
> +--+
> |  null|
> +--+
> {code}
> For up to 4471 numbers it is able to calculate the average. For 4472 or more 
> numbers it's {{null}}.
> Now I'll just change these numbers to {{double}}. But we got the types 
> entirely automatically. We never asked for {{decimal}}. If this is the 
> default type, it's important to support averaging a handful of them. (Sorry 
> for the bitterness. I like {{double}} more. :))
> Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
> that {{avg()}} fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25146) avg() returns null on some decimals

2018-08-17 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-25146.
-
Resolution: Duplicate

> avg() returns null on some decimals
> ---
>
> Key: SPARK-25146
> URL: https://issues.apache.org/jira/browse/SPARK-25146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Daniel Darabos
>Priority: Major
>
> We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
> them. The average in some cases comes out to {{null}} to our surprise (and 
> disappointment).
> After a bit of digging it looks like these numbers have ended up with the 
> {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
> this type:
> {code}
> scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")
> scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
> x").createOrReplaceTempView("x")
> scala> spark.sql("select avg(v) from x").show
> +--+
> |avg(v)|
> +--+
> |  null|
> +--+
> {code}
> For up to 4471 numbers it is able to calculate the average. For 4472 or more 
> numbers it's {{null}}.
> Now I'll just change these numbers to {{double}}. But we got the types 
> entirely automatically. We never asked for {{decimal}}. If this is the 
> default type, it's important to support averaging a handful of them. (Sorry 
> for the bitterness. I like {{double}} more. :))
> Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
> that {{avg()}} fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-08-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583928#comment-16583928
 ] 

Marco Gaido commented on SPARK-25145:
-

cc [~dongjoon]

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter console 5.1.0
> Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
> In [1]: %run -i create_bug.py
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT
> /_/
> Using Python version 3.6.3 (default, May 4 2018 04:22:28)
> SparkSession available as 'spark'.
> Created spark dataframe:
> +---+---+
> | a| b|
> +---+---+
> | 0|0.0|
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 4|2.0|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> +---+---+
> Read entire table with "filterPushdown"=True
> +---+---+
> | a| b|
> +---+---+
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> | 4|2.0|
> | 

[jira] [Commented] (SPARK-25138) Spark Shell should show the Scala prompt after initialization is complete

2018-08-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583585#comment-16583585
 ] 

Marco Gaido commented on SPARK-25138:
-

[~smilegator] this is caused by SPARK-24418 and it is a duplicate of 
SPARK-24785, for which there is a PR. cc [~dbtsai]

I am closing this as a duplicate. Please reopen if needed. Thanks.

> Spark Shell should show the Scala prompt after initialization is complete
> -
>
> Key: SPARK-25138
> URL: https://issues.apache.org/jira/browse/SPARK-25138
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> In previous Spark versions, the Spark Shell used to only show the Scala 
> prompt *after* Spark has initialized. i.e. when the user is able to enter 
> code, the Spark context, Spark session etc have all completed initialization, 
> so {{sc}}, {{spark}} are all ready to use.
> In the current Spark master branch (to become Spark 2.4.0), the Scala prompt 
> shows up immediately, while Spark itself is still in initialization in the 
> background. It's very easy for the user to feel as if the shell is ready and 
> start typing, only to find that Spark isn't ready yet, and Spark's 
> initialization logs get in the way of typing. This new behavior is rather 
> annoying from a usability's perspective.
> A typical startup of the Spark Shell in current master:
> {code:none}
> $ bin/spark-shell
> 18/08/16 23:18:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.range(1)Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1534486692744).
> Spark session available as 'spark'.
> .show
> +---+
> | id|
> +---+
> |  0|
> +---+
> scala> 
> {code}
> Could you see that it was running {{spark.range(1).show}} ?
> In contrast, previous versions of Spark Shell would wait for Spark to fully 
> initialization:
> {code:none}
> $ bin/spark-shell
> 18/08/16 23:20:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://10.0.0.76:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1534486813159).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.range(1).show
> +---+
> | id|
> +---+
> |  0|
> +---+
> scala> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25138) Spark Shell should show the Scala prompt after initialization is complete

2018-08-17 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-25138.
-
Resolution: Duplicate

> Spark Shell should show the Scala prompt after initialization is complete
> -
>
> Key: SPARK-25138
> URL: https://issues.apache.org/jira/browse/SPARK-25138
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> In previous Spark versions, the Spark Shell used to only show the Scala 
> prompt *after* Spark has initialized. i.e. when the user is able to enter 
> code, the Spark context, Spark session etc have all completed initialization, 
> so {{sc}}, {{spark}} are all ready to use.
> In the current Spark master branch (to become Spark 2.4.0), the Scala prompt 
> shows up immediately, while Spark itself is still in initialization in the 
> background. It's very easy for the user to feel as if the shell is ready and 
> start typing, only to find that Spark isn't ready yet, and Spark's 
> initialization logs get in the way of typing. This new behavior is rather 
> annoying from a usability's perspective.
> A typical startup of the Spark Shell in current master:
> {code:none}
> $ bin/spark-shell
> 18/08/16 23:18:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.range(1)Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1534486692744).
> Spark session available as 'spark'.
> .show
> +---+
> | id|
> +---+
> |  0|
> +---+
> scala> 
> {code}
> Could you see that it was running {{spark.range(1).show}} ?
> In contrast, previous versions of Spark Shell would wait for Spark to fully 
> initialization:
> {code:none}
> $ bin/spark-shell
> 18/08/16 23:20:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://10.0.0.76:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1534486813159).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.range(1).show
> +---+
> | id|
> +---+
> |  0|
> +---+
> scala> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25093) CodeFormatter could avoid creating regex object again and again

2018-08-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582543#comment-16582543
 ] 

Marco Gaido commented on SPARK-25093:
-

[~igreenfi] do you want to submit a PR for this? Otherwise I can do it. Thanks.

> CodeFormatter could avoid creating regex object again and again
> ---
>
> Key: SPARK-25093
> URL: https://issues.apache.org/jira/browse/SPARK-25093
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Minor
>
> in class `CodeFormatter` 
> method: `stripExtraNewLinesAndComments`
> could be refactored to: 
> {code:scala}
> // Some comments here
>  val commentReg =
> ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
>   """([ |\t]*?\/\/[\s\S]*?\n)""").r  // strip //comment
>   val emptyRowsReg = """\n\s*\n""".r
> def stripExtraNewLinesAndComments(input: String): String = {
> val codeWithoutComment = commentReg.replaceAllIn(input, "")
> emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines
>   }
> {code}
> so the Regex would be compiled only once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25031) The schema of MapType can not be printed correctly

2018-08-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582531#comment-16582531
 ] 

Marco Gaido commented on SPARK-25031:
-

^ kindly ping [~smilegator]

> The schema of MapType can not be printed correctly
> --
>
> Key: SPARK-25031
> URL: https://issues.apache.org/jira/browse/SPARK-25031
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Hao Ren
>Priority: Minor
>  Labels: easyfix
>
> Something wrong with the function `buildFormattedString` in `MapType`
>  
> {code:java}
> import spark.implicits._
> case class Key(a: Int)
> case class Value(b: Int)
> Seq(
>   (1, Map(Key(1) -> Value(2))), 
>   (2, Map(Key(1) -> Value(2)))
> ).toDF("id", "dict").printSchema
> {code}
> The result is:
> {code:java}
> root
> |-- id: integer (nullable = false)
> |-- dict: map (nullable = true)
> | |-- key: struct
> | |-- value: struct (valueContainsNull = true)
> | | |-- a: integer (nullable = false)
> | | |-- b: integer (nullable = false)
> {code}
>  The expected is
> {code:java}
> root
> |-- id: integer (nullable = false)
> |-- dict: map (nullable = true)
> | |-- key: struct
> | | |-- a: integer (nullable = false)
> | |-- value: struct (valueContainsNull = true)
> | | |-- b: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

2018-08-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582383#comment-16582383
 ] 

Marco Gaido edited comment on SPARK-25125 at 8/16/18 1:07 PM:
--

I think his may be a duplicate of SPARK-24013. [~myali] may you please try and 
check whether current master still have the issue?
If


was (Author: mgaido):
I think his may be a duplicate of SPARK-25125. [~myali] may you please try and 
check whether current master still have the issue?
If

> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -
>
> Key: SPARK-25125
> URL: https://issues.apache.org/jira/browse/SPARK-25125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Mir Ali
>Priority: Major
>
> The percentile_approx function in Spark SQL takes much longer than the 
> previous Hive implementation for large data sets (7B rows grouped into 200k 
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 
> 2.1.0.
> The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, 
> this does not finish at all in more than 2 hours. Also tried this with 
> different accuracy values 5000,1000,500, the timing does get better with 
> smaller datasets with the new version, but the speed difference is 
> insignificant
>  
> Infrastructure used:
> AWS EMR -> Spark 2.1.0
> vs
> AWS EMR  -> Spark 2.3.1
>  
> spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g 
> --conf spark.sql.shuffle.partitions=2000 --conf 
> spark.default.parallelism=2000 --num-executors=75 --executor-cores=2
> {code:java}
> import org.apache.spark.sql.functions._ 
> import org.apache.spark.sql.types._ 
> val df=spark.range(70L).withColumn("some_grouping_id", 
> round(rand()*20L).cast(LongType)) 
> df.createOrReplaceTempView("tab")   
> val percentile_query = """ select some_grouping_id, percentile_approx(id, 
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
> spark.sql(percentile_query).collect()
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

2018-08-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582383#comment-16582383
 ] 

Marco Gaido edited comment on SPARK-25125 at 8/16/18 1:07 PM:
--

I think this may be a duplicate of SPARK-24013. [~myali] may you please try and 
check whether current master still have the issue?


was (Author: mgaido):
I think his may be a duplicate of SPARK-24013. [~myali] may you please try and 
check whether current master still have the issue?
If

> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -
>
> Key: SPARK-25125
> URL: https://issues.apache.org/jira/browse/SPARK-25125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Mir Ali
>Priority: Major
>
> The percentile_approx function in Spark SQL takes much longer than the 
> previous Hive implementation for large data sets (7B rows grouped into 200k 
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 
> 2.1.0.
> The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, 
> this does not finish at all in more than 2 hours. Also tried this with 
> different accuracy values 5000,1000,500, the timing does get better with 
> smaller datasets with the new version, but the speed difference is 
> insignificant
>  
> Infrastructure used:
> AWS EMR -> Spark 2.1.0
> vs
> AWS EMR  -> Spark 2.3.1
>  
> spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g 
> --conf spark.sql.shuffle.partitions=2000 --conf 
> spark.default.parallelism=2000 --num-executors=75 --executor-cores=2
> {code:java}
> import org.apache.spark.sql.functions._ 
> import org.apache.spark.sql.types._ 
> val df=spark.range(70L).withColumn("some_grouping_id", 
> round(rand()*20L).cast(LongType)) 
> df.createOrReplaceTempView("tab")   
> val percentile_query = """ select some_grouping_id, percentile_approx(id, 
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
> spark.sql(percentile_query).collect()
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

2018-08-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582383#comment-16582383
 ] 

Marco Gaido commented on SPARK-25125:
-

I think his may be a duplicate of SPARK-25125. [~myali] may you please try and 
check whether current master still have the issue?
If

> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -
>
> Key: SPARK-25125
> URL: https://issues.apache.org/jira/browse/SPARK-25125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Mir Ali
>Priority: Major
>
> The percentile_approx function in Spark SQL takes much longer than the 
> previous Hive implementation for large data sets (7B rows grouped into 200k 
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 
> 2.1.0.
> The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, 
> this does not finish at all in more than 2 hours. Also tried this with 
> different accuracy values 5000,1000,500, the timing does get better with 
> smaller datasets with the new version, but the speed difference is 
> insignificant
>  
> Infrastructure used:
> AWS EMR -> Spark 2.1.0
> vs
> AWS EMR  -> Spark 2.3.1
>  
> spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g 
> --conf spark.sql.shuffle.partitions=2000 --conf 
> spark.default.parallelism=2000 --num-executors=75 --executor-cores=2
> {code:java}
> import org.apache.spark.sql.functions._ 
> import org.apache.spark.sql.types._ 
> val df=spark.range(70L).withColumn("some_grouping_id", 
> round(rand()*20L).cast(LongType)) 
> df.createOrReplaceTempView("tab")   
> val percentile_query = """ select some_grouping_id, percentile_approx(id, 
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
> spark.sql(percentile_query).collect()
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23908) High-order function: transform(array, function) → array

2018-08-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581364#comment-16581364
 ] 

Marco Gaido commented on SPARK-23908:
-

[~huaxingao] they are not exposed through the Scala API, so they don't either 
on the other APIs. Thanks.

> High-order function: transform(array, function) → array
> ---
>
> Key: SPARK-23908
> URL: https://issues.apache.org/jira/browse/SPARK-23908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array that is the result of applying function to each element of 
> array:
> {noformat}
> SELECT transform(ARRAY [], x -> x + 1); -- []
> SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7]
> SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7]
> SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', 
> 'z0']
> SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x 
> -> x IS NOT NULL)); -- [[1, 2], [3]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581075#comment-16581075
 ] 

Marco Gaido commented on LIVY-489:
--

Sure [~jerryshao], thank you. I am submitting the first PR for 2, 3, 4. Thanks.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SPARK-25123) SimpleExprValue may cause the loss of a reference

2018-08-15 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-25123:
---

 Summary: SimpleExprValue may cause the loss of a reference
 Key: SPARK-25123
 URL: https://issues.apache.org/jira/browse/SPARK-25123
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Marco Gaido


While introducing the new JavaCode abstraction in order to enable tracking 
references and allowing transformations, we added 3 types of expression values. 
They are global variables, local variables and simple expressions.

While checking whether we could use this new abstraction for fixing an issue 
reported in another JIRA, I just realized that SimpleExprValue contains a 
string with the generated code, but this can actually contain other variables. 
Since the value is carried in SimpleExprValue is a string, though, we were 
loosing track of the variable reference.

So this JIRA is for using a Block in order to represent the java code carried 
by SimpleExprValue, so that we don't loose references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579514#comment-16579514
 ] 

Marco Gaido commented on SPARK-25051:
-

This was caused by the introduction of AnalysisBarrier. I will submit a PR for 
branch 2.3. On 2.4+ (current master) we don't have anymore this issue because 
AnalysisBarrier was removed. Anyway, this brings a question to me: shall we 
remove AnalysisBarrier from 2.3 line too? In the current situation, backporting 
any analyzer fix to 2.3 is going to be painful.
cc [~rxin] [~cloud_fan]

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579444#comment-16579444
 ] 

Marco Gaido commented on SPARK-25051:
-

cc [~jerryshao] shall we set it as a blocker for 2.3.2?

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25051:

Labels: correctness  (was: )

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578447#comment-16578447
 ] 

Marco Gaido commented on SPARK-24928:
-

Actually this is a duplicate of SPARK-11982, which solved the issue for the SQL 
API. For the RDD API, please be careful choosing the right side of the 
cartesian. I am closing this as a duplicate. Feel free to reopen if you think 
anything else can be done.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24928) spark sql cross join running time too long

2018-08-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-24928.
-
Resolution: Duplicate

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578348#comment-16578348
 ] 

Marco Gaido commented on SPARK-25094:
-

[~igreenfi] as I mentioned you, this is a known issue. You found a TODO because 
currently it is not possible to implement that TODO. There is an ongoing effort 
to make it happening, but it is a huge effort, so it will take time. Thanks.

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578237#comment-16578237
 ] 

Marco Gaido commented on LIVY-489:
--

[~jerryshao] I created 5 subtasks for this. Hope they are reasonable to you. 
Thanks.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-495) Add basic UI for thriftserver

2018-08-13 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-495:


 Summary: Add basic UI for thriftserver
 Key: LIVY-495
 URL: https://issues.apache.org/jira/browse/LIVY-495
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido


The issue tracks the implementation of a UI showing basic information about the 
status of the Livy thriftserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-494) Add thriftserver to Livy server

2018-08-13 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-494:


 Summary: Add thriftserver to Livy server
 Key: LIVY-494
 URL: https://issues.apache.org/jira/browse/LIVY-494
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido


Including the thriftserver in the Livy server. This means starting the 
Thriftserver at Livy server startup and adding the needed script in order to 
interact with it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (LIVY-493) Add UTs to the thriftserver module

2018-08-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/LIVY-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated LIVY-493:
-
Description: Tracks the implementation and addition of UT for the new Livy 
thriftserver.

> Add UTs to the thriftserver module
> --
>
> Key: LIVY-493
> URL: https://issues.apache.org/jira/browse/LIVY-493
> Project: Livy
>  Issue Type: Sub-task
>Reporter: Marco Gaido
>Priority: Major
>
> Tracks the implementation and addition of UT for the new Livy thriftserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-493) Add UTs to the thriftserver module

2018-08-13 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-493:


 Summary: Add UTs to the thriftserver module
 Key: LIVY-493
 URL: https://issues.apache.org/jira/browse/LIVY-493
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-492) Base implementation Livy thriftserver

2018-08-13 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-492:


 Summary: Base implementation Livy thriftserver
 Key: LIVY-492
 URL: https://issues.apache.org/jira/browse/LIVY-492
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido


The issue tracks the lading of the initial implementation of the Livy 
thriftserver



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (LIVY-490) Add thriftserver module

2018-08-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/LIVY-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido closed LIVY-490.

Resolution: Duplicate

> Add thriftserver module
> ---
>
> Key: LIVY-490
> URL: https://issues.apache.org/jira/browse/LIVY-490
> Project: Livy
>  Issue Type: Sub-task
>Reporter: Marco Gaido
>Priority: Major
>
> Add a new module for the Thriftserver implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-491) Add thriftserver module

2018-08-13 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-491:


 Summary: Add thriftserver module
 Key: LIVY-491
 URL: https://issues.apache.org/jira/browse/LIVY-491
 Project: Livy
  Issue Type: Sub-task
  Components: Server
Affects Versions: 0.6.0
Reporter: Marco Gaido


Add a new module for the implementation of the Livy thriftserver.
This includes adding the base thriftserver implementation from Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-490) Add thriftserver module

2018-08-13 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-490:


 Summary: Add thriftserver module
 Key: LIVY-490
 URL: https://issues.apache.org/jira/browse/LIVY-490
 Project: Livy
  Issue Type: Sub-task
Reporter: Marco Gaido


Add a new module for the Thriftserver implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578226#comment-16578226
 ] 

Marco Gaido commented on LIVY-489:
--

Sure [~jerryshao], the branch is 
https://github.com/mgaido91/incubator-livy/tree/livy_thrift, and the diff is 
https://github.com/apache/incubator-livy/compare/master...mgaido91:livy_thrift.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25093) CodeFormatter could avoid creating regex object again and again

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578038#comment-16578038
 ] 

Marco Gaido commented on SPARK-25093:
-

I just marked this as a minor priority ticket, anyway I agree with the proposed 
improvement. Are you submitting a PR for it? Thanks.

> CodeFormatter could avoid creating regex object again and again
> ---
>
> Key: SPARK-25093
> URL: https://issues.apache.org/jira/browse/SPARK-25093
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Minor
>
> in class `CodeFormatter` 
> method: `stripExtraNewLinesAndComments`
> could be refactored to: 
> {code:scala}
> // Some comments here
>  val commentReg =
> ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
>   """([ |\t]*?\/\/[\s\S]*?\n)""").r  // strip //comment
>   val emptyRowsReg = """\n\s*\n""".r
> def stripExtraNewLinesAndComments(input: String): String = {
> val codeWithoutComment = commentReg.replaceAllIn(input, "")
> emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines
>   }
> {code}
> so the Regex would be compiled only once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25093) CodeFormatter could avoid creating regex object again and again

2018-08-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25093:

Priority: Minor  (was: Major)

> CodeFormatter could avoid creating regex object again and again
> ---
>
> Key: SPARK-25093
> URL: https://issues.apache.org/jira/browse/SPARK-25093
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Minor
>
> in class `CodeFormatter` 
> method: `stripExtraNewLinesAndComments`
> could be refactored to: 
> {code:scala}
> // Some comments here
>  val commentReg =
> ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
>   """([ |\t]*?\/\/[\s\S]*?\n)""").r  // strip //comment
>   val emptyRowsReg = """\n\s*\n""".r
> def stripExtraNewLinesAndComments(input: String): String = {
> val codeWithoutComment = commentReg.replaceAllIn(input, "")
> emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines
>   }
> {code}
> so the Regex would be compiled only once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577995#comment-16577995
 ] 

Marco Gaido commented on LIVY-489:
--

Hi [~jerryshao]. Thanks for your comment. Unfortunately I am not sure how to 
spit the implementation as most of the code is required for it to work. s of 
now, the only two tasks I have been able to split it into are:
 - Initial Thriftserver implementation;
 - Adding a Thriftserver UI.
I will keep on thinking on this anyway. Any suggestion is welcomed. Thanks.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-12 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577593#comment-16577593
 ] 

Marco Gaido commented on SPARK-25094:
-

This is a duplicate of many. Unfortunately this problem has not yet been 
solved, so in this case whole-stage code generation is disabled for the query. 
There is an ongoing effort in order to enable to fix this issue in the future 
though.

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-09 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated LIVY-489:
-
Description: 
Many users and BI tools use JDBC connections in order to retrieve data. As Livy 
exposes only a REST API, this is a limitation in its adoption. Hence, adding a 
JDBC endpoint may be a very useful feature, which could also make Livy a more 
attractive solution for end user to adopt.

Moreover, currently, Spark exposes a JDBC interface, but this has many 
limitations, including that all the queries are submitted to the same 
application, therefore there is no isolation/security, which can be offered by 
Livy, making a Livy JDBC API a better solution for companies/users who want to 
use Spark in order to run they queries through JDBC.

In order to make the transition from existing solutions to the new JDBC server 
seamless, the proposal is to use the Hive thrift-server and extend it as it was 
done by the STS.

[Here, you can find the design 
doc.|https://drive.google.com/file/d/10r8aF1xmL2MTtuREawGcrJobMf5Abtts/view?usp=sharing]
 

  was:
Many users and BI tools use JDBC connections in order to retrieve data. As Livy 
exposes only a REST API, this is a limitation in its adoption. Hence, adding a 
JDBC endpoint may be a very useful feature, which could also make Livy a more 
attractive solution for end user to adopt.

Moreover, currently, Spark exposes a JDBC interface, but this has many 
limitations, including that all the queries are submitted to the same 
application, therefore there is no isolation/security, which can be offered by 
Livy, making a Livy JDBC API a better solution for companies/users who want to 
use Spark in order to run they queries through JDBC.

In order to make the transition from existing solutions to the new JDBC server 
seamless, the proposal is to use the Hive thrift-server and extend it as it was 
done by the STS.

[Here, you can find the design 
doc.|https://docs.google.com/a/hortonworks.com/document/d/e/2PACX-1vS-ffJwXJ5nZluV-81AJ4WvS3SFX_KcZ0Djz9QGeEtLullYdLHT8dJvuwPpLBT2s3EU4CO6ij14wVcv/pub]
 


> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://drive.google.com/file/d/10r8aF1xmL2MTtuREawGcrJobMf5Abtts/view?usp=sharing]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-09 Thread Marco Gaido (JIRA)
Marco Gaido created LIVY-489:


 Summary: Expose a JDBC endpoint for Livy
 Key: LIVY-489
 URL: https://issues.apache.org/jira/browse/LIVY-489
 Project: Livy
  Issue Type: New Feature
  Components: API, Server
Affects Versions: 0.6.0
Reporter: Marco Gaido


Many users and BI tools use JDBC connections in order to retrieve data. As Livy 
exposes only a REST API, this is a limitation in its adoption. Hence, adding a 
JDBC endpoint may be a very useful feature, which could also make Livy a more 
attractive solution for end user to adopt.

Moreover, currently, Spark exposes a JDBC interface, but this has many 
limitations, including that all the queries are submitted to the same 
application, therefore there is no isolation/security, which can be offered by 
Livy, making a Livy JDBC API a better solution for companies/users who want to 
use Spark in order to run they queries through JDBC.

In order to make the transition from existing solutions to the new JDBC server 
seamless, the proposal is to use the Hive thrift-server and extend it as it was 
done by the STS.

[Here, you can find the design 
doc.|https://docs.google.com/a/hortonworks.com/document/d/e/2PACX-1vS-ffJwXJ5nZluV-81AJ4WvS3SFX_KcZ0Djz9QGeEtLullYdLHT8dJvuwPpLBT2s3EU4CO6ij14wVcv/pub]
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25031) The schema of MapType can not be printed correctly

2018-08-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573299#comment-16573299
 ] 

Marco Gaido commented on SPARK-25031:
-

[~smilegator] shall this be resolved as 
https://github.com/apache/spark/pull/22006 was merged? Thanks.

> The schema of MapType can not be printed correctly
> --
>
> Key: SPARK-25031
> URL: https://issues.apache.org/jira/browse/SPARK-25031
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Hao Ren
>Priority: Minor
>  Labels: easyfix
>
> Something wrong with the function `buildFormattedString` in `MapType`
>  
> {code:java}
> import spark.implicits._
> case class Key(a: Int)
> case class Value(b: Int)
> Seq(
>   (1, Map(Key(1) -> Value(2))), 
>   (2, Map(Key(1) -> Value(2)))
> ).toDF("id", "dict").printSchema
> {code}
> The result is:
> {code:java}
> root
> |-- id: integer (nullable = false)
> |-- dict: map (nullable = true)
> | |-- key: struct
> | |-- value: struct (valueContainsNull = true)
> | | |-- a: integer (nullable = false)
> | | |-- b: integer (nullable = false)
> {code}
>  The expected is
> {code:java}
> root
> |-- id: integer (nullable = false)
> |-- dict: map (nullable = true)
> | |-- key: struct
> | | |-- a: integer (nullable = false)
> | |-- value: struct (valueContainsNull = true)
> | | |-- b: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25042) Flaky test: org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic

2018-08-07 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-25042:
---

 Summary: Flaky test: 
org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic
 Key: SPARK-25042
 URL: https://issues.apache.org/jira/browse/SPARK-25042
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Marco Gaido


The test {{compacted topic}} in 
{{org.apache.spark.streaming.kafka010.KafkaRDDSuite}} is flaky: it failed in an 
unrelated PR: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94293/testReport/.
 And it passes locally on the same branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24928) spark sql cross join running time too long

2018-08-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570289#comment-16570289
 ] 

Marco Gaido edited comment on SPARK-24928 at 8/6/18 4:13 PM:
-

[~matthewnormyle] the fix you are proposing doesn't solve the problem, but it 
returns a wrong result. The root cause of the issue here is that the for is a 
nested loop. So if the outer iterator is the small one, we build much less 
iterators than otherwise. I think that in the RDD case there is few we can do, 
while for the SQL case we can probably add an optimizer rule using the 
statistics (if they are available).

PS I will submit soon a PR with the Optimizer rule to use the best side to 
build the nested loop if we have the stats. I don't think we can do anything 
else. Thanks.


was (Author: mgaido):
[~matthewnormyle] the fix you are proposing doesn't solve the problem, but it 
returns a wrong result. The root cause of the issue here is that the for is a 
nested loop. So if the outer iterator is the small one, we build much less 
iterators than otherwise. I think that in the RDD case there is few we can do, 
while for the SQL case we can probably add an optimizer rule using the 
statistics (if they are available).

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24928) spark sql cross join running time too long

2018-08-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570289#comment-16570289
 ] 

Marco Gaido edited comment on SPARK-24928 at 8/6/18 2:45 PM:
-

[~matthewnormyle] the fix you are proposing doesn't solve the problem, but it 
returns a wrong result. The root cause of the issue here is that the for is a 
nested loop. So if the outer iterator is the small one, we build much less 
iterators than otherwise. I think that in the RDD case there is few we can do, 
while for the SQL case we can probably add an optimizer rule using the 
statistics (if they are available).


was (Author: mgaido):
[~matthewnormyle] the fix you are proposing doesn't solve the problem, but it 
returns a wrong result. The root cause of the issue here is that the for is a 
nested loop. So if the outer iterator is the small one, we build much less 
iterators than otherwise. I think that in the RDD case there is few we can do, 
while for the SQL case we can probably add an optimizer rule using the 
statistics (if they are computed).

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570289#comment-16570289
 ] 

Marco Gaido commented on SPARK-24928:
-

[~matthewnormyle] the fix you are proposing doesn't solve the problem, but it 
returns a wrong result. The root cause of the issue here is that the for is a 
nested loop. So if the outer iterator is the small one, we build much less 
iterators than otherwise. I think that in the RDD case there is few we can do, 
while for the SQL case we can probably add an optimizer rule using the 
statistics (if they are computed).

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25012) dataframe creation results in matcherror

2018-08-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570022#comment-16570022
 ] 

Marco Gaido commented on SPARK-25012:
-

[~simm] you're right that the error message doesn't help and indeed it was 
fixed in SPARK-24366. So if you try in current master branch (or in the 
upcoming 2.4 release when it will be out), you should get a more meaningful 
error message which may help you in your debugging. I am not sure about the 
root cause of the "random" behavior of your test cases, but I think it is 
caused by some misuse in your code.

> dataframe creation results in matcherror
> 
>
> Key: SPARK-25012
> URL: https://issues.apache.org/jira/browse/SPARK-25012
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
> Environment: spark 2.3.1
> mac
> scala 2.11.12
>  
>Reporter: uwe
>Priority: Major
>
> hi,
>  
> running the attached code results in a 
>  
> {code:java}
> scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
> {code}
>  # i do think this is wrong (at least i do not see the issue in my code)
>  # the error is the ein 90% of the cases (it sometimes passes). that makes me 
> think something weird is going on
>  
>  
> {code:java}
> package misc
> import java.sql.Timestamp
> import java.time.LocalDateTime
> import java.time.format.DateTimeFormatter
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.sources._
> import org.apache.spark.sql.types.{StringType, StructField, StructType, 
> TimestampType}
> import org.apache.spark.sql.{Row, SQLContext, SparkSession}
> case class LogRecord(application:String, dateTime: Timestamp, component: 
> String, level: String, message: String)
> class LogRelation(val sqlContext: SQLContext, val path: String) extends 
> BaseRelation with PrunedFilteredScan {
>  override def schema: StructType = StructType(Seq(
>  StructField("application", StringType, false),
>  StructField("dateTime", TimestampType, false),
>  StructField("component", StringType, false),
>  StructField("level", StringType, false),
>  StructField("message", StringType, false)))
>  override def buildScan(requiredColumns: Array[String], filters: 
> Array[Filter]): RDD[Row] = {
>  val str = "2017-02-09T00:09:27"
>  val ts =Timestamp.valueOf(LocalDateTime.parse(str, 
> DateTimeFormatter.ISO_LOCAL_DATE_TIME))
>  val 
> data=List(Row("app",ts,"comp","level","mess"),Row("app",ts,"comp","level","mess"))
>  sqlContext.sparkContext.parallelize(data)
>  }
> }
> class LogDataSource extends DataSourceRegister with RelationProvider {
>  override def shortName(): String = "log"
>  override def createRelation(sqlContext: SQLContext, parameters: Map[String, 
> String]): BaseRelation =
>  new LogRelation(sqlContext, parameters("path"))
> }
> object f0 extends App {
>  lazy val spark: SparkSession = 
> SparkSession.builder().master("local").appName("spark session").getOrCreate()
>  val df = spark.read.format("log").load("hdfs:///logs")
>  df.show()
> }
>  
> {code}
>  
> results in the following stacktrace
>  
> {noformat}
> 11:20:06 [task-result-getter-0] ERROR o.a.spark.scheduler.TaskSetManager - 
> Task 0 in stage 0.0 failed 1 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): 
> scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
>  at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
>  at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>  at 
> 

[jira] [Commented] (SPARK-25012) dataframe creation results in matcherror

2018-08-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569960#comment-16569960
 ] 

Marco Gaido commented on SPARK-25012:
-

Seems the same as SPARK-24366. Seems anyway a problem in you schema 
definition/column mappings.

> dataframe creation results in matcherror
> 
>
> Key: SPARK-25012
> URL: https://issues.apache.org/jira/browse/SPARK-25012
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
> Environment: spark 2.3.1
> mac
> scala 2.11.12
>  
>Reporter: uwe
>Priority: Major
>
> hi,
>  
> running the attached code results in a 
>  
> {code:java}
> scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
> {code}
>  # i do think this is wrong (at least i do not see the issue in my code)
>  # the error is the ein 90% of the cases (it sometimes passes). that makes me 
> think something weird is going on
>  
>  
> {code:java}
> package misc
> import java.sql.Timestamp
> import java.time.LocalDateTime
> import java.time.format.DateTimeFormatter
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.sources._
> import org.apache.spark.sql.types.{StringType, StructField, StructType, 
> TimestampType}
> import org.apache.spark.sql.{Row, SQLContext, SparkSession}
> case class LogRecord(application:String, dateTime: Timestamp, component: 
> String, level: String, message: String)
> class LogRelation(val sqlContext: SQLContext, val path: String) extends 
> BaseRelation with PrunedFilteredScan {
>  override def schema: StructType = StructType(Seq(
>  StructField("application", StringType, false),
>  StructField("dateTime", TimestampType, false),
>  StructField("component", StringType, false),
>  StructField("level", StringType, false),
>  StructField("message", StringType, false)))
>  override def buildScan(requiredColumns: Array[String], filters: 
> Array[Filter]): RDD[Row] = {
>  val str = "2017-02-09T00:09:27"
>  val ts =Timestamp.valueOf(LocalDateTime.parse(str, 
> DateTimeFormatter.ISO_LOCAL_DATE_TIME))
>  val 
> data=List(Row("app",ts,"comp","level","mess"),Row("app",ts,"comp","level","mess"))
>  sqlContext.sparkContext.parallelize(data)
>  }
> }
> class LogDataSource extends DataSourceRegister with RelationProvider {
>  override def shortName(): String = "log"
>  override def createRelation(sqlContext: SQLContext, parameters: Map[String, 
> String]): BaseRelation =
>  new LogRelation(sqlContext, parameters("path"))
> }
> object f0 extends App {
>  lazy val spark: SparkSession = 
> SparkSession.builder().master("local").appName("spark session").getOrCreate()
>  val df = spark.read.format("log").load("hdfs:///logs")
>  df.show()
> }
>  
> {code}
>  
> results in the following stacktrace
>  
> {noformat}
> 11:20:06 [task-result-getter-0] ERROR o.a.spark.scheduler.TaskSetManager - 
> Task 0 in stage 0.0 failed 1 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): 
> scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
>  at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
>  at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  at 

[jira] [Commented] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-08-03 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568152#comment-16568152
 ] 

Marco Gaido commented on SPARK-23937:
-

I am working on this, thanks.

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24598) SPARK SQL:Datatype overflow conditions gives incorrect result

2018-08-03 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568005#comment-16568005
 ] 

Marco Gaido commented on SPARK-24598:
-

[~smilegator] as we just enhanced the doc, but we have not really addressed the 
overflow condition, which I think we are targeting for a fix for 3.0, shall we 
leave this open for now and resolve it once the actual fix is in place? What do 
you think? Thanks.

> SPARK SQL:Datatype overflow conditions gives incorrect result
> -
>
> Key: SPARK-24598
> URL: https://issues.apache.org/jira/browse/SPARK-24598
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: navya
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Execute an sql query, so that it results in overflow conditions. 
> EX - SELECT 9223372036854775807 + 1 result = -9223372036854776000
>  
> Expected result - Error should be throw like mysql. 
> mysql> SELECT 9223372036854775807 + 1;
> ERROR 1690 (22003): BIGINT value is out of range in '(9223372036854775807 + 
> 1)'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404

2018-07-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563259#comment-16563259
 ] 

Marco Gaido commented on SPARK-24975:
-

This seems a duplicate of SPARK-24188. Despite here I see that 2.3.1 is 
affected, while this should not be the case according to SPARK-24188. May you 
please check if 2.3.1 is actually affected and if not close this as duplicate? 
Thanks.

> Spark history server REST API /api/v1/version returns error 404
> ---
>
> Key: SPARK-24975
> URL: https://issues.apache.org/jira/browse/SPARK-24975
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: shanyu zhao
>Priority: Major
>
> Spark history server REST API provides /api/v1/version, according to doc:
> [https://spark.apache.org/docs/latest/monitoring.html]
> However, for Spark 2.3, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404
> Problem accessing /api/v1/version. Reason:
>  Not Foundhttp://eclipse.org/jetty;>Powered by 
> Jetty:// 9.3.z-SNAPSHOT
> 
> {code}
> On a Spark 2.2 cluster, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> {
> "spark" : "2.2.0"
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24944) SparkUi build problem

2018-07-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561587#comment-16561587
 ] 

Marco Gaido commented on SPARK-24944:
-

Can you close this JIRA as invalid? Thanks.

> SparkUi build problem
> -
>
> Key: SPARK-24944
> URL: https://issues.apache.org/jira/browse/SPARK-24944
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0, 2.3.1
> Environment: scala 2.11.8
> java version "1.8.0_181" 
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13) 
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> 
> Gradle 4.5.1
> 
> Build time: 2018-02-05 13:22:49 UTC
> Revision: 37007e1c012001ff09973e0bd095139239ecd3b3
> Groovy: 2.4.12
> Ant: Apache Ant(TM) version 1.9.9 compiled on February 2 2017
> JVM: 1.8.0_181 (Oracle Corporation 25.181-b13)
> OS: Windows 7 6.1 amd64
>  
> build.gradle:
> group 'it.build-test.spark'
> version '1.0-SNAPSHOT'
> apply plugin: 'java'
> apply plugin: 'scala'
> sourceCompatibility = 1.8
> repositories {
>  mavenCentral()
> }
> dependencies {
>  compile 'org.apache.spark:spark-core_2.11:2.3.1'
>  compile 'org.scala-lang:scala-library:2.11.8'
> }
> tasks.withType(ScalaCompile) {
>  scalaCompileOptions.additionalParameters = ["-Ylog-classpath"]
> }
>Reporter: Fabio
>Priority: Minor
>  Labels: UI, WebUI, build
> Attachments: build-test.zip
>
>
> Hi. I'm trying to customize SparkUi with my business logic. Trying to access 
> to ui, I have ta build problem. It's enough to create this class:
> _package org.apache.spark_
> _import org.apache.spark.ui.SparkUI_
> _case class SparkContextUtils(sc: SparkContext) {_
>  _def ui: Option[SparkUI] = sc.ui_
> _}_
>  
> to have this error:
>  
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term eclipse in package org,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org._
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term jetty in value org.eclipse,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org.eclipse._
> _two errors found_
> _:compileScala FAILED_
> _FAILURE: Build failed with an exception._
> _* What went wrong:_
> _Execution failed for task ':compileScala'._
> _> Compilation failed_
> _* Try:_
> _Run with --stacktrace option to get the stack trace. Run with --info or 
> --debug option to get more log output. Run with --scan to get full insights._
> _* Get more help at https://help.gradle.org_
> _BUILD FAILED in 26s_
> _1 actionable task: 1 executed_
> _Compilation failed_
>  
> The option "-Ylog-classpath" hasn't any useful information
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561077#comment-16561077
 ] 

Marco Gaido commented on SPARK-24957:
-

I am not sure what you mean by "When codegen is disabled all results are 
correct.". I checked and I was able to reproduce both with codegen enabled and 
with codegen disabled.

cc [~jerryshao] this doesn't seem a regression to me but it is a pretty serious 
bug, I am not sure whether we should include it in the next 2.3 version.
cc [~smilegator] [~cloud_fan] I think we should consider this a blocker for 
2.4. What do you think? Thanks.

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-07-27 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-24948:
---

 Summary: SHS filters wrongly some applications due to permission 
check
 Key: SPARK-24948
 URL: https://issues.apache.org/jira/browse/SPARK-24948
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: Marco Gaido


SHS filters the event logs it doesn't have permissions to read. Unfortunately, 
this check is quite naive, as it takes into account only the base permissions 
(ie. user, group, other permissions). For instance, if ACL are enabled, they 
are ignored in this check; moreover, each filesystem may have different 
policies (eg. they can consider spark as a superuser who can access everything).

This results in some applications not being displayed in the SHS, despite the 
Spark user (or whatever user the SHS is started with) can actually read their 
ent logs.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24944) SparkUi build problem

2018-07-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559790#comment-16559790
 ] 

Marco Gaido commented on SPARK-24944:
-

This seems more a problem in your project and your dependencies than an issue 
in Spark. This - rather than a JIRA - should have been a question sent to the 
mailing list.

> SparkUi build problem
> -
>
> Key: SPARK-24944
> URL: https://issues.apache.org/jira/browse/SPARK-24944
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0, 2.3.1
> Environment: scala 2.11.8
> java version "1.8.0_181" 
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13) 
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> 
> Gradle 4.5.1
> 
> Build time: 2018-02-05 13:22:49 UTC
> Revision: 37007e1c012001ff09973e0bd095139239ecd3b3
> Groovy: 2.4.12
> Ant: Apache Ant(TM) version 1.9.9 compiled on February 2 2017
> JVM: 1.8.0_181 (Oracle Corporation 25.181-b13)
> OS: Windows 7 6.1 amd64
>  
> build.gradle:
> group 'it.build-test.spark'
> version '1.0-SNAPSHOT'
> apply plugin: 'java'
> apply plugin: 'scala'
> sourceCompatibility = 1.8
> repositories {
>  mavenCentral()
> }
> dependencies {
>  compile 'org.apache.spark:spark-core_2.11:2.3.1'
>  compile 'org.scala-lang:scala-library:2.11.8'
> }
> tasks.withType(ScalaCompile) {
>  scalaCompileOptions.additionalParameters = ["-Ylog-classpath"]
> }
>Reporter: Fabio
>Priority: Major
>  Labels: UI, WebUI, build
> Attachments: build-test.zip
>
>
> Hi. I'm trying to customize SparkUi with my business logic. Trying to access 
> to ui, I have ta build problem. It's enough to create this class:
> _package org.apache.spark_
> _import org.apache.spark.ui.SparkUI_
> _case class SparkContextUtils(sc: SparkContext) {_
>  _def ui: Option[SparkUI] = sc.ui_
> _}_
>  
> to have this error:
>  
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term eclipse in package org,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org._
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term jetty in value org.eclipse,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org.eclipse._
> _two errors found_
> _:compileScala FAILED_
> _FAILURE: Build failed with an exception._
> _* What went wrong:_
> _Execution failed for task ':compileScala'._
> _> Compilation failed_
> _* Try:_
> _Run with --stacktrace option to get the stack trace. Run with --info or 
> --debug option to get more log output. Run with --scan to get full insights._
> _* Get more help at https://help.gradle.org_
> _BUILD FAILED in 26s_
> _1 actionable task: 1 executed_
> _Compilation failed_
>  
> The option "-Ylog-classpath" hasn't any useful information
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-07-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558287#comment-16558287
 ] 

Marco Gaido commented on SPARK-24928:
-

The affected version is pretty old, can you check a newer version?

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data

2018-07-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555652#comment-16555652
 ] 

Marco Gaido edited comment on SPARK-24904 at 7/25/18 1:28 PM:
--

I see now what you mean, but yes, I think there is an assumption you are doing 
which is not always true, ie. "The output is (expected to be) very small 
compared to the big table". That is not true. If all the rows from the big 
table match the small one, this is not the case. We may trying to do something 
like what you mentioned in the optimizer if CBO is enabled and we have good 
enough statistics about the output size of the inner join, but i am not sure.


was (Author: mgaido):
I see now what you mean, but yes, It think there is an assumption you are doing 
which is not always true, ie. "The output is (expected to be) very small 
compared to the big table". That is not true. If all the rows from the big 
table match the small one, this is not the case. We may trying to do something 
like what you mentioned in the optimizer if CBO is enabled and we have good 
enough statistics about the output size of the inner join, but i am not sure.

> Join with broadcasted dataframe causes shuffle of redundant data
> 
>
> Key: SPARK-24904
> URL: https://issues.apache.org/jira/browse/SPARK-24904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.2
>Reporter: Shay Elbaz
>Priority: Minor
>
> When joining a "large" dataframe with broadcasted small one, and join-type is 
> on the small DF side (see right-join below), the physical plan falls back to 
> sort merge join. But when the join is on the large DF side, the broadcast 
> does take place. Is there a good reason for this? In the below example it 
> sure doesn't make any sense to shuffle the entire large table:
>  
> {code:java}
> val small = spark.range(1, 10)
> val big = spark.range(1, 1 << 30)
>   .withColumnRenamed("id", "id2")
> big.join(broadcast(small), $"id" === $"id2", "right")
> .explain
> //OUTPUT:
> == Physical Plan == 
> SortMergeJoin [id2#16307L], [id#16310L], RightOuter 
> :- *Sort [id2#16307L ASC NULLS FIRST], false, 0
>  :  +- Exchange hashpartitioning(id2#16307L, 1000)
>  : +- *Project [id#16304L AS id2#16307L]
>  :    +- *Range (1, 1073741824, step=1, splits=Some(600))
>  +- *Sort [id#16310L ASC NULLS FIRST], false, 0
>     +- Exchange hashpartitioning(id#16310L, 1000)
>    +- *Range (1, 10, step=1, splits=Some(600))
> {code}
> As a workaround, users need to perform inner instead of right join, and then 
> join the result back with the small DF to fill the missing rows.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data

2018-07-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555842#comment-16555842
 ] 

Marco Gaido commented on SPARK-24904:
-

[~shay_elbaz] In the case I mentioned before the approach you proposed is not 
better, it is worse, as it requires an unneeded additional broadcast join.

> Join with broadcasted dataframe causes shuffle of redundant data
> 
>
> Key: SPARK-24904
> URL: https://issues.apache.org/jira/browse/SPARK-24904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.2
>Reporter: Shay Elbaz
>Priority: Minor
>
> When joining a "large" dataframe with broadcasted small one, and join-type is 
> on the small DF side (see right-join below), the physical plan falls back to 
> sort merge join. But when the join is on the large DF side, the broadcast 
> does take place. Is there a good reason for this? In the below example it 
> sure doesn't make any sense to shuffle the entire large table:
>  
> {code:java}
> val small = spark.range(1, 10)
> val big = spark.range(1, 1 << 30)
>   .withColumnRenamed("id", "id2")
> big.join(broadcast(small), $"id" === $"id2", "right")
> .explain
> //OUTPUT:
> == Physical Plan == 
> SortMergeJoin [id2#16307L], [id#16310L], RightOuter 
> :- *Sort [id2#16307L ASC NULLS FIRST], false, 0
>  :  +- Exchange hashpartitioning(id2#16307L, 1000)
>  : +- *Project [id#16304L AS id2#16307L]
>  :    +- *Range (1, 1073741824, step=1, splits=Some(600))
>  +- *Sort [id#16310L ASC NULLS FIRST], false, 0
>     +- Exchange hashpartitioning(id#16310L, 1000)
>    +- *Range (1, 10, step=1, splits=Some(600))
> {code}
> As a workaround, users need to perform inner instead of right join, and then 
> join the result back with the small DF to fill the missing rows.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data

2018-07-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555652#comment-16555652
 ] 

Marco Gaido commented on SPARK-24904:
-

I see now what you mean, but yes, It think there is an assumption you are doing 
which is not always true, ie. "The output is (expected to be) very small 
compared to the big table". That is not true. If all the rows from the big 
table match the small one, this is not the case. We may trying to do something 
like what you mentioned in the optimizer if CBO is enabled and we have good 
enough statistics about the output size of the inner join, but i am not sure.

> Join with broadcasted dataframe causes shuffle of redundant data
> 
>
> Key: SPARK-24904
> URL: https://issues.apache.org/jira/browse/SPARK-24904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.2
>Reporter: Shay Elbaz
>Priority: Minor
>
> When joining a "large" dataframe with broadcasted small one, and join-type is 
> on the small DF side (see right-join below), the physical plan falls back to 
> sort merge join. But when the join is on the large DF side, the broadcast 
> does take place. Is there a good reason for this? In the below example it 
> sure doesn't make any sense to shuffle the entire large table:
>  
> {code:java}
> val small = spark.range(1, 10)
> val big = spark.range(1, 1 << 30)
>   .withColumnRenamed("id", "id2")
> big.join(broadcast(small), $"id" === $"id2", "right")
> .explain
> //OUTPUT:
> == Physical Plan == 
> SortMergeJoin [id2#16307L], [id#16310L], RightOuter 
> :- *Sort [id2#16307L ASC NULLS FIRST], false, 0
>  :  +- Exchange hashpartitioning(id2#16307L, 1000)
>  : +- *Project [id#16304L AS id2#16307L]
>  :    +- *Range (1, 1073741824, step=1, splits=Some(600))
>  +- *Sort [id#16310L ASC NULLS FIRST], false, 0
>     +- Exchange hashpartitioning(id#16310L, 1000)
>    +- *Range (1, 10, step=1, splits=Some(600))
> {code}
> As a workaround, users need to perform inner instead of right join, and then 
> join the result back with the small DF to fill the missing rows.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data

2018-07-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555477#comment-16555477
 ] 

Marco Gaido commented on SPARK-24904:
-

You cannot do a broadcast join when it is on the side of the small table, as 
the join requires to compare each row of the small table with the whole big 
table and output it into the result if it is not met. Since the big table is 
available only in small pieces in each task, no task can determine whether the 
row matched at least once (as it doesn't know what other tasks did).

> Join with broadcasted dataframe causes shuffle of redundant data
> 
>
> Key: SPARK-24904
> URL: https://issues.apache.org/jira/browse/SPARK-24904
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.1.2
>Reporter: Shay Elbaz
>Priority: Minor
>
> When joining a "large" dataframe with broadcasted small one, and join-type is 
> on the small DF side (see right-join below), the physical plan does not 
> include broadcasting the small table. But when the join is on the large DF 
> side, the broadcast does take place. Is there a good reason for this? In the 
> below example it sure doesn't make any sense to shuffle the entire large 
> table:
>  
> {code:java}
> val small = spark.range(1, 10)
> val big = spark.range(1, 1 << 30)
>   .withColumnRenamed("id", "id2")
> big.join(broadcast(small), $"id" === $"id2", "right")
> .explain
> //OUTPUT:
> == Physical Plan == 
> SortMergeJoin [id2#16307L], [id#16310L], RightOuter 
> :- *Sort [id2#16307L ASC NULLS FIRST], false, 0
>  :  +- Exchange hashpartitioning(id2#16307L, 1000)
>  : +- *Project [id#16304L AS id2#16307L]
>  :    +- *Range (1, 1073741824, step=1, splits=Some(600))
>  +- *Sort [id#16310L ASC NULLS FIRST], false, 0
>     +- Exchange hashpartitioning(id#16310L, 1000)
>    +- *Range (1, 10, step=1, splits=Some(600))
> {code}
> As a workaround, users need to perform inner instead of right join, and then 
> join the result back with the small DF to fill the missing rows.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-07-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543336#comment-16543336
 ] 

Marco Gaido commented on SPARK-24498:
-

[~maropu] yes, I remembered I had some troubles compiling the generated code 
with the jdk compiler too. There is also one case (which I saw in the branch 
you prepared you addressed generating the proper code according to the chosen 
compiler) in which there isn't really a way to make both them happy. In other 
cases, when there is a form which works fine on both, I think it would be great 
to use it. So I agree with your proposal.

My only concern is that as of now we have no way to check the compilation with 
JDK, so it would be probably hard to enforce that we correct all the problems 
and/or we don't introduce new ones. So the risk is that the effort spent on 
that task could be not so useful...

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24782) Simplify conf access in expressions

2018-07-11 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-24782:
---

 Summary: Simplify conf access in expressions
 Key: SPARK-24782
 URL: https://issues.apache.org/jira/browse/SPARK-24782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Marco Gaido


Previously, we were not able to access configs on executor side. This lead to 
some workarounds for getting the right configuration on the driver and send 
them to the executors when dealing with SQL expressions. As these workarounds 
are not needed anymore, we can remove them and simplify the was SQLConf are 
accessed by them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24268) DataType in error messages are not coherent

2018-07-10 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24268:

Description: 
In SPARK-22893 there was a tentative to unify the way dataTypes are reported in 
error messages. There, we decided to use always {{dataType.simpleString}}. 
Unfortunately, we missed many places where this still needed to be fixed. 
Moreover, it turns out that the right method to use is not {{simpleString}}, 
but we should use {{catalogString}} instead (for further details please check 
the discussion in the PR https://github.com/apache/spark/pull/21321).

So we should update all the missing places in order to provide error messages 
coherently throughout the project.

  was:
In SPARK-22893 there was a tentative to unify the way dataTypes are reported in 
error messages. There, we decided to use always {{dataType.simpleString}}. 
Unfortunately, we missed many places where this still needed to be fixed. 
Moreover, it turns out that the right method to use is not {{simpleString}}, 
but we should use {{catalogString}} instead (for further details please check 
the discussion in the PR ).

So we should update all the missing places in order to provide error messages 
coherently throughout the project.


> DataType in error messages are not coherent
> ---
>
> Key: SPARK-24268
> URL: https://issues.apache.org/jira/browse/SPARK-24268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>
> In SPARK-22893 there was a tentative to unify the way dataTypes are reported 
> in error messages. There, we decided to use always {{dataType.simpleString}}. 
> Unfortunately, we missed many places where this still needed to be fixed. 
> Moreover, it turns out that the right method to use is not {{simpleString}}, 
> but we should use {{catalogString}} instead (for further details please check 
> the discussion in the PR https://github.com/apache/spark/pull/21321).
> So we should update all the missing places in order to provide error messages 
> coherently throughout the project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24268) DataType in error messages are not coherent

2018-07-10 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24268:

Description: 
In SPARK-22893 there was a tentative to unify the way dataTypes are reported in 
error messages. There, we decided to use always {{dataType.simpleString}}. 
Unfortunately, we missed many places where this still needed to be fixed. 
Moreover, it turns out that the right method to use is not {{simpleString}}, 
but we should use {{catalogString}} instead (for further details please check 
the discussion in the PR ).

So we should update all the missing places in order to provide error messages 
coherently throughout the project.

  was:
In SPARK-22893 there was a tentative to unify the way dataTypes are reported in 
error messages. There, we decided to use always {{dataType.simpleString}}. 
Unfortunately, we missed many places where this still needed to be fixed. 
Moreover, it turns out that the right method to use is not {{simpleString}}, 
but we should use {{catalogString}} instead.

So we should update all the missing places in order to provide error messages 
coherently throughout the project.


> DataType in error messages are not coherent
> ---
>
> Key: SPARK-24268
> URL: https://issues.apache.org/jira/browse/SPARK-24268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>
> In SPARK-22893 there was a tentative to unify the way dataTypes are reported 
> in error messages. There, we decided to use always {{dataType.simpleString}}. 
> Unfortunately, we missed many places where this still needed to be fixed. 
> Moreover, it turns out that the right method to use is not {{simpleString}}, 
> but we should use {{catalogString}} instead (for further details please check 
> the discussion in the PR ).
> So we should update all the missing places in order to provide error messages 
> coherently throughout the project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24268) DataType in error messages are not coherent

2018-07-10 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24268:

Description: 
In SPARK-22893 there was a tentative to unify the way dataTypes are reported in 
error messages. There, we decided to use always {{dataType.simpleString}}. 
Unfortunately, we missed many places where this still needed to be fixed. 
Moreover, it turns out that the right method to use is not {{simpleString}}, 
but we should use {{catalogString}} instead.

So we should update all the missing places in order to provide error messages 
coherently throughout the project.

  was:
In SPARK-22893 there was a tentative to unify the way dataTypes are reported in 
error messages. There, we decided to use always {{dataType.simpleString}}. 
Unfortunately, we missed many places where this still needed to be fixed.

So we should update all the missing places in order to provide error messages 
coherently throughout the project.


> DataType in error messages are not coherent
> ---
>
> Key: SPARK-24268
> URL: https://issues.apache.org/jira/browse/SPARK-24268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>
> In SPARK-22893 there was a tentative to unify the way dataTypes are reported 
> in error messages. There, we decided to use always {{dataType.simpleString}}. 
> Unfortunately, we missed many places where this still needed to be fixed. 
> Moreover, it turns out that the right method to use is not {{simpleString}}, 
> but we should use {{catalogString}} instead.
> So we should update all the missing places in order to provide error messages 
> coherently throughout the project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24745) Map function does not keep rdd name

2018-07-10 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538260#comment-16538260
 ] 

Marco Gaido commented on SPARK-24745:
-

A RDD already has a unique ID. I think the name is just useful for the 
UI/debug, but if you want to use it in your application you can still set the 
name also on the RDD you create from mapping the original RDD or you can create 
your own RDD implementation which retrieves the name from any ancestor.

> Map function does not keep rdd name 
> 
>
> Key: SPARK-24745
> URL: https://issues.apache.org/jira/browse/SPARK-24745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Igor Pergenitsa
>Priority: Minor
>
> This snippet
> {code:scala}
> val namedRdd = sparkContext.makeRDD(List("abc", "123")).setName("named_rdd")
> println(namedRdd.name)
> val mappedRdd = namedRdd.map(_.length)
> println(mappedRdd.name){code}
> outputs:
> named_rdd
> null



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24745) Map function does not keep rdd name

2018-07-09 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537199#comment-16537199
 ] 

Marco Gaido commented on SPARK-24745:
-

This makes sense, as the map operation creates a new RDD. So the new RDD has no 
name.

> Map function does not keep rdd name 
> 
>
> Key: SPARK-24745
> URL: https://issues.apache.org/jira/browse/SPARK-24745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Igor Pergenitsa
>Priority: Minor
>
> This snippet
> {code:scala}
> val namedRdd = sparkContext.makeRDD(List("abc", "123")).setName("named_rdd")
> println(namedRdd.name)
> val mappedRdd = namedRdd.map(_.length)
> println(mappedRdd.name){code}
> outputs:
> named_rdd
> null



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24719) ClusteringEvaluator supports integer type labels

2018-07-09 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536747#comment-16536747
 ] 

Marco Gaido commented on SPARK-24719:
-

[~mengxr] any luck with this? Thanks.

> ClusteringEvaluator supports integer type labels
> 
>
> Key: SPARK-24719
> URL: https://issues.apache.org/jira/browse/SPARK-24719
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> ClusterEvaluator should support integer labels because we output integer 
> labels in BisectingKMeans. 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77].
>  We should cast numeric types to double in ClusteringEvaluator.
> [~mgaido] Do you have time to work on the fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-09 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536682#comment-16536682
 ] 

Marco Gaido commented on SPARK-24438:
-

IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value 
in partitions.

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-09 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536682#comment-16536682
 ] 

Marco Gaido edited comment on SPARK-24438 at 7/9/18 8:37 AM:
-

IIRC, Hive has a placeholder string (\_\_HIVE_DEFAULT_PARTITION\_\_) for null 
value in partitions.


was (Author: mgaido):
IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value 
in partitions.

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (YARN-8385) Clean local directories when a container is killed

2018-07-09 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536677#comment-16536677
 ] 

Marco Gaido commented on YARN-8385:
---

Thanks for your answer [~jlowe]. As it is stated in the question on SO 
(https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache)
 I think the application directory is used. I see why the data is not removed 
by YARN from you comment above, though. So I think we have to investigate why 
Spark is using the application directory in this case. Thanks.

> Clean local directories when a container is killed
> --
>
> Key: YARN-8385
> URL: https://issues.apache.org/jira/browse/YARN-8385
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Marco Gaido
>Priority: Major
>
> In long running applications, it may happen that many containers are created 
> and killed. A use case is Spark Thrift Server when dynamic allocation is 
> enabled. A lot of containers are killed and the application keeps running 
> indefinitely.
> Currently, YARN seems to remove the local directories only when the whole 
> application terminates. In the scenario described above, this can cause 
> serious resource leakages. Please, check 
> https://issues.apache.org/jira/browse/SPARK-22575.
> I think YARN should clean up all the local directories of a container when it 
> is killed and not when the whole application terminates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (KNOX-1362) Add documentation for the interaction with Spark History Server (SHS)

2018-07-03 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531989#comment-16531989
 ] 

Marco Gaido commented on KNOX-1362:
---

Thanks for your work [~smore]. Sure, no worries. Thank you.

> Add documentation for the interaction with Spark History Server (SHS)
> -
>
> Key: KNOX-1362
> URL: https://issues.apache.org/jira/browse/KNOX-1362
> Project: Apache Knox
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 1.1.0
>
> Attachments: KNOX-1362.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24719) ClusteringEvaluator supports integer type labels

2018-07-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530493#comment-16530493
 ] 

Marco Gaido commented on SPARK-24719:
-

[~mengxr] I tried to pass integer values in the prediction column and I was not 
able to reproduce any issue (I tried both distance measures). I also checked 
the code and the prediction column is casted to double where needed. Can you 
provide a repro if you faced any issue? If that is not the case, is this JIRA 
meant for doing a small refactor which makes the casting more clear? Thanks.

> ClusteringEvaluator supports integer type labels
> 
>
> Key: SPARK-24719
> URL: https://issues.apache.org/jira/browse/SPARK-24719
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> ClusterEvaluator should support integer labels because we output integer 
> labels in BisectingKMeans. 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77].
>  We should cast numeric types to double in ClusteringEvaluator.
> [~mgaido] Do you have time to work on the fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24719) ClusteringEvaluator supports integer type labels

2018-07-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530264#comment-16530264
 ] 

Marco Gaido commented on SPARK-24719:
-

Sure,thanks. I'll submit a PR ASAP.

> ClusteringEvaluator supports integer type labels
> 
>
> Key: SPARK-24719
> URL: https://issues.apache.org/jira/browse/SPARK-24719
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> ClusterEvaluator should support integer labels because we output integer 
> labels in BisectingKMeans. 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77].
>  We should cast numeric types to double in ClusteringEvaluator.
> [~mgaido] Do you have time to work on the fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24712) TrainValidationSplit ignores label column name and forces to be "label"

2018-07-02 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-24712.
-
Resolution: Not A Problem

> TrainValidationSplit ignores label column name and forces to be "label"
> ---
>
> Key: SPARK-24712
> URL: https://issues.apache.org/jira/browse/SPARK-24712
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Pablo J. Villacorta
>Priority: Major
>
> When a TrainValidationSplit is fit on a Pipeline containing a ML model, the 
> labelCol property of the model is ignored, and the call to fit() will fail 
> unless the labelCol equals "label". As an example, the following pyspark code 
> only works when the variable labelColumn is set to "label"
> {code:java}
> from pyspark.sql.functions import rand, randn
> from pyspark.ml.regression import LinearRegression
> labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS
> df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), 
> randn(seed=27).alias(labelColumn))
> vectorAssembler = 
> VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
> lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
> mypipeline = Pipeline(stages = [vectorAssembler, lr])
> paramGrid = ParamGridBuilder()\
> .addGrid(lr.regParam, [0.01, 0.1])\
> .build()
> trainValidationSplit = TrainValidationSplit()\
> .setEstimator(mypipeline)\
> .setEvaluator(RegressionEvaluator())\
> .setEstimatorParamMaps(paramGrid)\
> .setTrainRatio(0.8)
> trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24712) TrainValidationSplit ignores label column name and forces to be "label"

2018-07-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529746#comment-16529746
 ] 

Marco Gaido commented on SPARK-24712:
-

The problem is that you have not set the label on the evaluator you are passing 
to {{TrainValidationSplit}}. Please set it there and it will work. I am closing 
this, feel free to reopen if you face a problem.

> TrainValidationSplit ignores label column name and forces to be "label"
> ---
>
> Key: SPARK-24712
> URL: https://issues.apache.org/jira/browse/SPARK-24712
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Pablo J. Villacorta
>Priority: Major
>
> When a TrainValidationSplit is fit on a Pipeline containing a ML model, the 
> labelCol property of the model is ignored, and the call to fit() will fail 
> unless the labelCol equals "label". As an example, the following pyspark code 
> only works when the variable labelColumn is set to "label"
> {code:java}
> from pyspark.sql.functions import rand, randn
> from pyspark.ml.regression import LinearRegression
> labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS
> df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), 
> randn(seed=27).alias(labelColumn))
> vectorAssembler = 
> VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
> lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
> mypipeline = Pipeline(stages = [vectorAssembler, lr])
> paramGrid = ParamGridBuilder()\
> .addGrid(lr.regParam, [0.01, 0.1])\
> .build()
> trainValidationSplit = TrainValidationSplit()\
> .setEstimator(mypipeline)\
> .setEvaluator(RegressionEvaluator())\
> .setEstimatorParamMaps(paramGrid)\
> .setTrainRatio(0.8)
> trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24208) Cannot resolve column in self join after applying Pandas UDF

2018-06-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525210#comment-16525210
 ] 

Marco Gaido commented on SPARK-24208:
-

I think this may be a duplicate of SPARK-24373. Can you try 2.3.1?

> Cannot resolve column in self join after applying Pandas UDF
> 
>
> Key: SPARK-24208
> URL: https://issues.apache.org/jira/browse/SPARK-24208
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: AWS EMR 5.13.0
> Amazon Hadoop distribution 2.8.3
> Spark 2.3.0
> Pandas 0.22.0
>Reporter: Rafal Ganczarek
>Priority: Minor
>
> I noticed that after applying Pandas UDF function, a self join of resulted 
> DataFrame will fail to resolve columns. The workaround that I found is to 
> recreate DataFrame with its RDD and schema.
> Below you can find a Python code that reproduces the issue.
> {code:java}
> from pyspark import Row
> import pyspark.sql.functions as F
> @F.pandas_udf('key long, col string', F.PandasUDFType.GROUPED_MAP)
> def dummy_pandas_udf(df):
> return df[['key','col']]
> df = spark.createDataFrame([Row(key=1,col='A'), Row(key=1,col='B'), 
> Row(key=2,col='C')])
> # transformation that causes the issue
> df = df.groupBy('key').apply(dummy_pandas_udf)
> # WORKAROUND that fixes the issue
> # df = spark.createDataFrame(df.rdd, df.schema)
> df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> {code}
> If workaround line is commented out, then above code fails with the following 
> error:
> {code:java}
> AnalysisExceptionTraceback (most recent call last)
>  in ()
>  12 # df = spark.createDataFrame(df.rdd, df.schema)
>  13 
> ---> 14 df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> /usr/lib/spark/python/pyspark/sql/dataframe.py in join(self, other, on, how)
> 929 on = self._jseq([])
> 930 assert isinstance(how, basestring), "how should be 
> basestring"
> --> 931 jdf = self._jdf.join(other._jdf, on, how)
> 932 return DataFrame(jdf, self.sql_ctx)
> 933 
> /usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, 
> *args)
>1158 answer = self.gateway_client.send_command(command)
>1159 return_value = get_return_value(
> -> 1160 answer, self.gateway_client, self.target_id, self.name)
>1161 
>1162 for temp_arg in temp_args:
> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  67  
> e.java_exception.getStackTrace()))
>  68 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 69 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
>  70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
>  71 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
> AnalysisException: u"cannot resolve '`temp0.key`' given input columns: 
> [temp0.key, temp0.col];;\n'Join Inner, ('temp0.key = 'temp1.key)\n:- 
> AnalysisBarrier\n: +- SubqueryAlias temp0\n:+- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n:   +- Project [key#4099L, col#4098, 
> key#4099L]\n:  +- LogicalRDD [col#4098, key#4099L], false\n+- 
> AnalysisBarrier\n  +- SubqueryAlias temp1\n +- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n+- Project [key#4099L, col#4098, 
> key#4099L]\n   +- LogicalRDD [col#4098, key#4099L], false\n"
> {code}
> The same happens, if instead of DataFrame API I use Spark SQL to do a self 
> join:
> {code:java}
> # df is a DataFrame after applying dummy_pandas_udf
> df.createOrReplaceTempView('df')
> spark.sql('''
> SELECT 
> *
> FROM df temp0
> LEFT JOIN df temp1 ON
> temp0.key == temp1.key
> ''').show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24660) SHS is not showing properly errors when downloading logs

2018-06-26 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-24660:
---

 Summary: SHS is not showing properly errors when downloading logs
 Key: SPARK-24660
 URL: https://issues.apache.org/jira/browse/SPARK-24660
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: Marco Gaido


The History Server is not showing properly errors which happen when trying to 
download logs. In particular, when downloading logs for which the user is not 
authorized, the user sees a File not found error, instead of the unauthorized 
response.

Similarly, trying to download logs from a non-existing application returns a 
server error, instead of a 404 message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (KNOX-1362) Add documentation for the interaction with Spark History Server (SHS)

2018-06-22 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated KNOX-1362:
--
Attachment: KNOX-1362.patch

> Add documentation for the interaction with Spark History Server (SHS)
> -
>
> Key: KNOX-1362
> URL: https://issues.apache.org/jira/browse/KNOX-1362
> Project: Apache Knox
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Marco Gaido
>Priority: Major
> Attachments: KNOX-1362.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KNOX-1362) Add documentation for the interaction with Spark History Server (SHS)

2018-06-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520314#comment-16520314
 ] 

Marco Gaido commented on KNOX-1362:
---

Thank you [~smore]!

> Add documentation for the interaction with Spark History Server (SHS)
> -
>
> Key: KNOX-1362
> URL: https://issues.apache.org/jira/browse/KNOX-1362
> Project: Apache Knox
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Marco Gaido
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520125#comment-16520125
 ] 

Marco Gaido commented on SPARK-24498:
-

Thanks for your great analysis [~maropu]! Very interesting. Seems like there is 
no advantage in introducing a new compiler.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (KNOX-1362) Add documentation for the interaction with SHS

2018-06-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/KNOX-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519036#comment-16519036
 ] 

Marco Gaido commented on KNOX-1362:
---

[~lmccay] I created the issue as you suggested in KNOX-1354. Unfortunately, 
though, I cannot find where the doc is in order to provide a patch for it. May 
you help me with this? Thanks.

> Add documentation for the interaction with SHS
> --
>
> Key: KNOX-1362
> URL: https://issues.apache.org/jira/browse/KNOX-1362
> Project: Apache Knox
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Marco Gaido
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KNOX-1362) Add documentation for the interaction with SHS

2018-06-21 Thread Marco Gaido (JIRA)
Marco Gaido created KNOX-1362:
-

 Summary: Add documentation for the interaction with SHS
 Key: KNOX-1362
 URL: https://issues.apache.org/jira/browse/KNOX-1362
 Project: Apache Knox
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Marco Gaido






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KNOX-1315) Spark UI urls issue: Jobs, stdout/stderr and threadDump links

2018-06-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/KNOX-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519032#comment-16519032
 ] 

Marco Gaido commented on KNOX-1315:
---

[~lmccay] this is actually a patch on YARN UI. As I am not an expert in that 
area, may someone else take a look at it? Thanks.

> Spark UI urls issue: Jobs, stdout/stderr and threadDump links
> -
>
> Key: KNOX-1315
> URL: https://issues.apache.org/jira/browse/KNOX-1315
> Project: Apache Knox
>  Issue Type: Bug
>Affects Versions: 0.14.0, 1.0.0
>Reporter: Guang Yang
>Assignee: Guang Yang
>Priority: Major
> Fix For: 1.1.0
>
> Attachments: KNOX-1315.patch
>
>
> When users get to the SPARK UI page by clicking on the *{{Application 
> Master}}* link the Yarn application page for running applications, the link 
> for *Individual job* doesn't work. Also, if users go to *executors* page, the 
> stdout/stderr and threadDump link don't work as well.
> The above issues are at this page 
> https://host:port/gateway/sandbox/yarn/proxy/application_1525479109400_910288



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24607) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518196#comment-16518196
 ] 

Marco Gaido commented on SPARK-24607:
-

[~viirya] please check the description in the Hive ticket. This happens when 
there are task failures. I have not tried to reproduce and check whether Spark 
is affected too, but it may be.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: SPARK-24607
> URL: https://issues.apache.org/jira/browse/SPARK-24607
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: zenglinxi
>Priority: Major
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with 
> hiveSQL which include  distribute by rand, I think it's also an hidden 
> serious problem in sparksql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24606) Decimals multiplication and division may be null due to the result precision overflow

2018-06-20 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24606:

Priority: Major  (was: Blocker)

> Decimals multiplication and division may be null due to the result precision 
> overflow
> -
>
> Key: SPARK-24606
> URL: https://issues.apache.org/jira/browse/SPARK-24606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Yan Jian
>Priority: Major
>
> Spark performs mul / div on Decimals via Java's BigDecimal, whose scale may 
> be greater than its precision, with 38 precision limit. 
> If the result BigDecimal's precision is 38, and its scale is greater than 38 
> ( 39 e.g. ), the converted decimal (in spark SQL) is in precision of 40 ( = 
> 39 + 1, and > 38 ).
>  
> Run following SQLs to reproduce this:
> {code:sql}
> select (cast (1.0 as decimal(38,37))) * 1.8;
> select (cast (0.07654387654321 as decimal(38,37))) / 
> 99;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24606) Decimals multiplication and division may be null due to the result precision overflow

2018-06-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518130#comment-16518130
 ] 

Marco Gaido commented on SPARK-24606:
-

Critical and Blocker are reserved for committers. Closing as this is a 
duplicate. Thanks.

> Decimals multiplication and division may be null due to the result precision 
> overflow
> -
>
> Key: SPARK-24606
> URL: https://issues.apache.org/jira/browse/SPARK-24606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Yan Jian
>Priority: Major
>
> Spark performs mul / div on Decimals via Java's BigDecimal, whose scale may 
> be greater than its precision, with 38 precision limit. 
> If the result BigDecimal's precision is 38, and its scale is greater than 38 
> ( 39 e.g. ), the converted decimal (in spark SQL) is in precision of 40 ( = 
> 39 + 1, and > 38 ).
>  
> Run following SQLs to reproduce this:
> {code:sql}
> select (cast (1.0 as decimal(38,37))) * 1.8;
> select (cast (0.07654387654321 as decimal(38,37))) / 
> 99;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24606) Decimals multiplication and division may be null due to the result precision overflow

2018-06-20 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-24606.
-
Resolution: Duplicate

> Decimals multiplication and division may be null due to the result precision 
> overflow
> -
>
> Key: SPARK-24606
> URL: https://issues.apache.org/jira/browse/SPARK-24606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Yan Jian
>Priority: Blocker
>
> Spark performs mul / div on Decimals via Java's BigDecimal, whose scale may 
> be greater than its precision, with 38 precision limit. 
> If the result BigDecimal's precision is 38, and its scale is greater than 38 
> ( 39 e.g. ), the converted decimal (in spark SQL) is in precision of 40 ( = 
> 39 + 1, and > 38 ).
>  
> Run following SQLs to reproduce this:
> {code:sql}
> select (cast (1.0 as decimal(38,37))) * 1.8;
> select (cast (0.07654387654321 as decimal(38,37))) / 
> 99;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-06-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514687#comment-16514687
 ] 

Marco Gaido commented on SPARK-23901:
-

These functions can be used as any other function in Hive, they are not just 
there for the Hive authorizer. I think the use case for them is to anonymize 
data for privacy reasons (eg. expose/export to other parties data without 
providing sensible data, but still being able to use them in joins).

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (KNOX-1358) Create new version definition for SHS

2018-06-15 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/KNOX-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated KNOX-1358:
--
Attachment: KNOX-1358.patch

> Create new version definition for SHS
> -
>
> Key: KNOX-1358
> URL: https://issues.apache.org/jira/browse/KNOX-1358
> Project: Apache Knox
>  Issue Type: New Feature
>Reporter: Marco Gaido
>Priority: Major
> Attachments: KNOX-1358.patch
>
>
> As SHS is now leveraging the opportunity to have support to 
> X-Forwarded-Context and fixed some issues with the UI when behind a proxy, we 
> can provide a service definition for newer SHS version exploiting those 
> features without keeping on working on the old service definitions which is 
> there since version 1.4.0 of Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KNOX-1358) Create new version definition for SHS

2018-06-15 Thread Marco Gaido (JIRA)
Marco Gaido created KNOX-1358:
-

 Summary: Create new version definition for SHS
 Key: KNOX-1358
 URL: https://issues.apache.org/jira/browse/KNOX-1358
 Project: Apache Knox
  Issue Type: New Feature
Reporter: Marco Gaido


As SHS is now leveraging the opportunity to have support to X-Forwarded-Context 
and fixed some issues with the UI when behind a proxy, we can provide a service 
definition for newer SHS version exploiting those features without keeping on 
working on the old service definitions which is there since version 1.4.0 of 
Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KNOX-1353) SHS always showing link to incomplete applications

2018-06-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/KNOX-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513503#comment-16513503
 ] 

Marco Gaido commented on KNOX-1353:
---

Sorry [~lmccay], I'll be more careful next time. Thanks.

> SHS always showing link to incomplete applications
> --
>
> Key: KNOX-1353
> URL: https://issues.apache.org/jira/browse/KNOX-1353
> Project: Apache Knox
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 1.1.0
>
> Attachments: KNOX-1353.patch
>
>
> SHS is always showing the link to "Show incomplete applications", also when 
> it is showing the incomplete applications. Instead there it should show the 
> link "Back to completed applications".
> The reason of this behavior is that the URL is not rewritten correctly and 
> the parameter {{?showIncomplete=true}} in the URL is getting lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SPARK-24562) Allow running same tests with multiple configs in SQLQueryTestSuite

2018-06-14 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-24562:
---

 Summary: Allow running same tests with multiple configs in 
SQLQueryTestSuite
 Key: SPARK-24562
 URL: https://issues.apache.org/jira/browse/SPARK-24562
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Marco Gaido


We often need to run the same queries with different configs in order to check 
their behavior in any condition. In particular, we have 2 cases:
 - same queries with different configs should have same result;
 - same queries with different configs should have different results.

This ticket aims to introduce the support for both cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   >