[jira] [Assigned] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22319:


Assignee: (was: Apache Spark)

> SparkSubmit calls getFileStatus before calling loginUserFromKeytab
> --
>
> Key: SPARK-22319
> URL: https://issues.apache.org/jira/browse/SPARK-22319
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.0
>Reporter: Steven Rand
>
> In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
> {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.
> We do this before we call {{loginUserFromKeytab}}, which is further down in 
> the same method: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.
> The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
> with:
> {code}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> {code}
> A workaround is to {{kinit}} on the host before using spark-submit. However, 
> it's better if this workaround isn't necessary. A simple fix is to call 
> loginUserFromKeytab before attempting to interact with HDFS.
> At least for cluster mode, this would appear to be a regression caused by 
> SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212161#comment-16212161
 ] 

Apache Spark commented on SPARK-22319:
--

User 'sjrand' has created a pull request for this issue:
https://github.com/apache/spark/pull/19540

> SparkSubmit calls getFileStatus before calling loginUserFromKeytab
> --
>
> Key: SPARK-22319
> URL: https://issues.apache.org/jira/browse/SPARK-22319
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.0
>Reporter: Steven Rand
>
> In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
> {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.
> We do this before we call {{loginUserFromKeytab}}, which is further down in 
> the same method: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.
> The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
> with:
> {code}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> {code}
> A workaround is to {{kinit}} on the host before using spark-submit. However, 
> it's better if this workaround isn't necessary. A simple fix is to call 
> loginUserFromKeytab before attempting to interact with HDFS.
> At least for cluster mode, this would appear to be a regression caused by 
> SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22319:


Assignee: Apache Spark

> SparkSubmit calls getFileStatus before calling loginUserFromKeytab
> --
>
> Key: SPARK-22319
> URL: https://issues.apache.org/jira/browse/SPARK-22319
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.0
>Reporter: Steven Rand
>Assignee: Apache Spark
>
> In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
> {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.
> We do this before we call {{loginUserFromKeytab}}, which is further down in 
> the same method: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.
> The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
> with:
> {code}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> {code}
> A workaround is to {{kinit}} on the host before using spark-submit. However, 
> it's better if this workaround isn't necessary. A simple fix is to call 
> loginUserFromKeytab before attempting to interact with HDFS.
> At least for cluster mode, this would appear to be a regression caused by 
> SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212154#comment-16212154
 ] 

Liang-Chi Hsieh commented on SPARK-22291:
-

Could you send a PR for this?

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>  Labels: patch, postgresql, sql
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> 

[jira] [Created] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-19 Thread Steven Rand (JIRA)
Steven Rand created SPARK-22319:
---

 Summary: SparkSubmit calls getFileStatus before calling 
loginUserFromKeytab
 Key: SPARK-22319
 URL: https://issues.apache.org/jira/browse/SPARK-22319
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 2.3.0
Reporter: Steven Rand


In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
{{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.

We do this before we call {{loginUserFromKeytab}}, which is further down in the 
same method: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.

The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
with:

{code}
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
{code}

A workaround is to {{kinit}} on the host before using spark-submit. However, 
it's better if this workaround isn't necessary. A simple fix is to call 
loginUserFromKeytab before attempting to interact with HDFS.

At least for cluster mode, this would appear to be a regression caused by 
SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212123#comment-16212123
 ] 

Liang-Chi Hsieh commented on SPARK-22296:
-

Seems no problem with 2.2?

{code}
scala> case class TestData(x: Int, s: String, seq: 
scala.collection.mutable.Seq[Int])
defined class TestData

scala> val ds = Seq(TestData(1, "a", scala.collection.mutable.Seq.empty[Int]), 
TestData(2, "b", scala.collection.mutable.Seq(1, 2))).toDS

scala> ds.show
+---+---+--+
|  x|  s|   seq|
+---+---+--+
|  1|  a|[]|
|  2|  b|[1, 2]|
+---+---+--+

scala> ds.printSchema
root
 |-- x: integer (nullable = false)
 |-- s: string (nullable = true)
 |-- seq: array (nullable = true)
 ||-- element: integer (containsNull = false)

scala> ds.select("seq")
res4: org.apache.spark.sql.DataFrame = [seq: array]

scala> ds.select("seq").show
+--+
|   seq|
+--+
|[]|
|[1, 2]|
+--+
{code}

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   

[jira] [Created] (SPARK-22318) spark stream Kafka hang at JavaStreamingContext.start, no spark job create

2017-10-19 Thread iceriver322 (JIRA)
iceriver322 created SPARK-22318:
---

 Summary: spark stream Kafka hang at JavaStreamingContext.start, no 
spark job create
 Key: SPARK-22318
 URL: https://issues.apache.org/jira/browse/SPARK-22318
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.0
 Environment: OS:Red Hat Enterprise Linux Server release 6.5
JRE:Oracle 1.8.0.144-b01
spark-streaming_2.11:2.1.0
spark-streaming-kafka-0-10_2.11:2.1.0
Reporter: iceriver322


spark stream Kafka jar submitted by spark-submit to standalone spark cluster, 
and running well for a few days. But recently, we find that no new job 
generated for the stream,  we tried to restart the job, and restart the 
cluster,  the stream just stuck at JavaStreamingContext.start, and WAITING (on 
object monitor).  Thread dump :

2017-10-19 16:44:23
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode):

"Attach Listener" #82 daemon prio=9 os_prio=0 tid=0x7f76f0002000 nid=0x3f80 
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

"SparkUI-JettyScheduler" #81 daemon prio=5 os_prio=0 tid=0x7f76ac002800 
nid=0x3d5c waiting on condition [0x7f7693bfa000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xfa19f940> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None

"shuffle-server-3-4" #35 daemon prio=5 os_prio=0 tid=0x7f76a0041800 
nid=0x3d34 runnable [0x7f76911e5000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xf8ea3be8> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xf8ee3600> (a java.util.Collections$UnmodifiableSet)
- locked <0xf8ea3ae0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:760)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:401)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None

"shuffle-server-3-3" #34 daemon prio=5 os_prio=0 tid=0x7f76a0040800 
nid=0x3d33 runnable [0x7f76912e6000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xfc2747c0> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xfc2874c0> (a java.util.Collections$UnmodifiableSet)
- locked <0xfc2746c8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:760)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:401)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None

"shuffle-server-3-2" #33 daemon prio=5 os_prio=0 tid=0x7f76a003e800 
nid=0x3d32 runnable [0x7f76913e7000]
   java.lang.Thread.State: RUNNABLE
at 

[jira] [Resolved] (SPARK-22026) data source v2 write path

2017-10-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22026.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> data source v2 write path
> -
>
> Key: SPARK-22026
> URL: https://issues.apache.org/jira/browse/SPARK-22026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22312) Spark job stuck with no executor due to bug in Executor Allocation Manager

2017-10-19 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212071#comment-16212071
 ] 

Saisai Shao commented on SPARK-22312:
-

I think it is duplicated to SPARK-11334 according to your description in PR.

> Spark job stuck with no executor due to bug in Executor Allocation Manager
> --
>
> Key: SPARK-22312
> URL: https://issues.apache.org/jira/browse/SPARK-22312
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We often see the issue of Spark jobs stuck because the Executor Allocation 
> Manager does not ask for any executor even if there are pending tasks in case 
> dynamic allocation is turned on. Looking at the logic in EAM which calculates 
> the running tasks, it can happen that the calculation will be wrong and the 
> number of running tasks can become negative. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22317) Spark Thrift job with HTTP ERROR 500

2017-10-19 Thread hoangle (JIRA)
hoangle created SPARK-22317:
---

 Summary: Spark Thrift job with HTTP ERROR 500
 Key: SPARK-22317
 URL: https://issues.apache.org/jira/browse/SPARK-22317
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.6.3
 Environment: Hortonwork HDP-2.6
Reporter: hoangle


I running Spark 1 Thrift Server as Proxy.

But it not running as long as I expected.

Every two days, Spark will die likely cronjob without any error log.
And sometime I can not access to Spark Thrift Server URL with http error 500. I 
checked on YARN RM and still see Spark Thrift running

This is this ERROR when I get error code 500.


{code:html}



Error 500 Server Error

HTTP ERROR 500
Problem accessing /jobs/. Reason:
Server ErrorCaused 
by:java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at 
scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:427)
at scala.xml.Node.buildString(Node.scala:161)
at scala.xml.Node.toString(Node.scala:166)
at 
org.apache.spark.ui.JettyUtils$$anonfun$htmlResponderToServlet$1.apply(JettyUtils.scala:55)
at 
org.apache.spark.ui.JettyUtils$$anonfun$htmlResponderToServlet$1.apply(JettyUtils.scala:55)
at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:83)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at 
org.spark-project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1507)
at 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:179)
at 
org.spark-project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)
at 
org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
at 
org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at 
org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:427)
at 
org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:301)
at 
org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.spark-project.jetty.server.Server.handle(Server.java:370)
at 
org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)
at 
org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:641)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)
at 
org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at 
org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)

Powered by Jetty://



{code}

I can see the reason is java.lang.OutOfMemoryError: Java heap space. But how 
can I fix that ? I do not know what option I need reconfigure on Spark




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22209) PySpark does not recognize imports from submodules

2017-10-19 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212027#comment-16212027
 ] 

Joel Croteau commented on SPARK-22209:
--

Yes [~bryanc], I know I can do that. It's just annoying that a long workflow 
failed after running for quite a while because of that. At the very least, it 
should be documented.

> PySpark does not recognize imports from submodules
> --
>
> Key: SPARK-22209
> URL: https://issues.apache.org/jira/browse/SPARK-22209
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
> Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK 
> 1.8, Centos 6
>Reporter: Joel Croteau
>Priority: Minor
>
> Using submodule syntax inside a PySpark job seems to create issues. For 
> example, the following:
> {code:python}
> import scipy.sparse
> from pyspark import SparkContext, SparkConf
> def do_stuff(x):
> y = scipy.sparse.dok_matrix((1, 1))
> y[0, 0] = x
> return y[0, 0]
> def init_context():
> conf = SparkConf().setAppName("Spark Test")
> sc = SparkContext(conf=conf)
> return sc
> def main():
> sc = init_context()
> data = sc.parallelize([1, 2, 3, 4])
> output_data = data.map(do_stuff)
> print(output_data.collect())
> __name__ == '__main__' and main()
> {code}
> produces this error:
> {noformat}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
> at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
> at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
>  line 177, in main
> process()

[jira] [Assigned] (SPARK-22268) Fix java style errors

2017-10-19 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22268:


Assignee: Andrew Ash

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>Priority: Trivial
> Fix For: 2.3.0
>
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22268) Fix java style errors

2017-10-19 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22268.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19486
[https://github.com/apache/spark/pull/19486]

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Trivial
> Fix For: 2.3.0
>
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-10-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211923#comment-16211923
 ] 

Joseph K. Bradley commented on SPARK-19357:
---

Linking [SPARK-22126], which tracks improving the API to permit model-specific 
optimizations.

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211906#comment-16211906
 ] 

Dongjoon Hyun commented on SPARK-22284:
---

Ping, [~mgaido].

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21785) Support create table from a file schema

2017-10-19 Thread Jacky Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211897#comment-16211897
 ] 

Jacky Shen commented on SPARK-21785:


any comment [~dongjoon]?

> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211849#comment-16211849
 ] 

Russell Spitzer edited comment on SPARK-22316 at 10/19/17 10:21 PM:


Nope, it's not the parens ... I'm allowed to have columns with parens in this 
example
{code}
scala> val ds = spark.createDataset(1 to 10).toDF("ColumnWith(Parens)")
ds: org.apache.spark.sql.DataFrame = [ColumnWith(Parens): int]
scala> ds.filter(ds(ds.columns(0)) < 5) . show
{code}




was (Author: rspitzer):
Nope I'm allowed to have columns with parens in this example
{code}
scala> val ds = spark.createDataset(1 to 10).toDF("ColumnWith(Parens)")
ds: org.apache.spark.sql.DataFrame = [ColumnWith(Parens): int]
scala> ds.filter(ds(ds.columns(0)) < 5) . show
{code}



> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataset(Seq(Customer(1,Person("russ", 85
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211849#comment-16211849
 ] 

Russell Spitzer commented on SPARK-22316:
-

Nope I'm allowed to have columns with parens in this example
{code}
scala> val ds = spark.createDataset(1 to 10).toDF("ColumnWith(Parens)")
ds: org.apache.spark.sql.DataFrame = [ColumnWith(Parens): int]
scala> ds.filter(ds(ds.columns(0)) < 5) . show
{code}



> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataset(Seq(Customer(1,Person("russ", 85
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Spitzer updated SPARK-22316:

Description: 
Given a dataset which has been run through reduceGroups like this
{code}
case class Person(name: String, age: Int)
case class Customer(id: Int, person: Person)
val ds = spark.createDataset(Seq(Customer(1,Person("russ", 85
val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
{code}

We end up with a Dataset with the schema

{code}
 org.apache.spark.sql.types.StructType = 
StructType(
  StructField(value,IntegerType,false), 
  StructField(ReduceAggregator(Customer),
StructType(StructField(id,IntegerType,false),
StructField(person,
  StructType(StructField(name,StringType,true),
  StructField(age,IntegerType,false))
   ,true))
,true))
{code}

The column names are 
{code}
Array(value, ReduceAggregator(Customer))
{code}

But you cannot select the "ReduceAggregatorColumn"
{code}
grouped.select(grouped.columns(1))
org.apache.spark.sql.AnalysisException: cannot resolve 
'`ReduceAggregator(Customer)`' given input columns: [value, 
ReduceAggregator(Customer)];;
'Project ['ReduceAggregator(Customer)]
+- Aggregate [value#338], [value#338, 
reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
Some(newInstance(class Customer)), Some(class Customer), 
Some(StructType(StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 AS 
value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) null 
else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).id AS id#195, person, if 
(isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, true]._2)).person)) 
null else named_struct(name, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).person).name, true), age, 
assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person)) null else named_struct(name, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person).name, true), age, 
assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person).age) AS person#196, StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true), true, 0, 0) AS 
ReduceAggregator(Customer)#346]
   +- AppendColumns , class Customer, 
[StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
[input[0, int, false] AS value#338]
  +- LocalRelation [id#197, person#198]
{code}

You can work around this by using "toDF" to rename the column

{code}
scala> grouped.toDF("key", "reduced").select("reduced")
res55: org.apache.spark.sql.DataFrame = [reduced: struct>]
{code}

I think that all invocations of 
{code}
ds.select(ds.columns(i))
{code}
For all valid i < columns size should work.

  was:
Given a dataset which has been run through reduceGroups like this
{code}
case class Person(name: String, age: Int)
case class Customer(id: Int, person: Person)
val ds = spark.createDataset(Seq(Customer(1,Person("russ", 85)))
val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
{code}

We end up with a Dataset with the schema

{code}
 org.apache.spark.sql.types.StructType = 
StructType(
  StructField(value,IntegerType,false), 
  StructField(ReduceAggregator(Customer),
StructType(StructField(id,IntegerType,false),
StructField(person,
  StructType(StructField(name,StringType,true),
  StructField(age,IntegerType,false))
   ,true))
,true))
{code}

The column names are 
{code}
Array(value, ReduceAggregator(Customer))
{code}

But you cannot select the "ReduceAggregatorColumn"
{code}
grouped.select(grouped.columns(1))
org.apache.spark.sql.AnalysisException: cannot resolve 
'`ReduceAggregator(Customer)`' given input columns: [value, 
ReduceAggregator(Customer)];;
'Project ['ReduceAggregator(Customer)]
+- Aggregate [value#338], [value#338, 
reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
Some(newInstance(class Customer)), Some(class Customer), 
Some(StructType(StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 AS 
value#340, if ((isnull(input[0, scala.Tuple2, 

[jira] [Updated] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Spitzer updated SPARK-22316:

Description: 
Given a dataset which has been run through reduceGroups like this
{code}
case class Person(name: String, age: Int)
case class Customer(id: Int, person: Person)
val ds = spark.createDataset(Seq(Customer(1,Person("russ", 85)))
val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
{code}

We end up with a Dataset with the schema

{code}
 org.apache.spark.sql.types.StructType = 
StructType(
  StructField(value,IntegerType,false), 
  StructField(ReduceAggregator(Customer),
StructType(StructField(id,IntegerType,false),
StructField(person,
  StructType(StructField(name,StringType,true),
  StructField(age,IntegerType,false))
   ,true))
,true))
{code}

The column names are 
{code}
Array(value, ReduceAggregator(Customer))
{code}

But you cannot select the "ReduceAggregatorColumn"
{code}
grouped.select(grouped.columns(1))
org.apache.spark.sql.AnalysisException: cannot resolve 
'`ReduceAggregator(Customer)`' given input columns: [value, 
ReduceAggregator(Customer)];;
'Project ['ReduceAggregator(Customer)]
+- Aggregate [value#338], [value#338, 
reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
Some(newInstance(class Customer)), Some(class Customer), 
Some(StructType(StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 AS 
value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) null 
else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).id AS id#195, person, if 
(isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, true]._2)).person)) 
null else named_struct(name, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).person).name, true), age, 
assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person)) null else named_struct(name, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person).name, true), age, 
assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person).age) AS person#196, StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true), true, 0, 0) AS 
ReduceAggregator(Customer)#346]
   +- AppendColumns , class Customer, 
[StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
[input[0, int, false] AS value#338]
  +- LocalRelation [id#197, person#198]
{code}

You can work around this by using "toDF" to rename the column

{code}
scala> grouped.toDF("key", "reduced").select("reduced")
res55: org.apache.spark.sql.DataFrame = [reduced: struct>]
{code}

I think that all invocations of 
{code}
ds.select(ds.columns(i))
{code}
For all valid i < columns size should work.

  was:
Given a dataset which has been run through reduceGroups like this
{code}
case class Person(name: String, age: Int)
case class Customer(id: Int, person: Person)
val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
{code}

We end up with a Dataset with the schema

{code}
 org.apache.spark.sql.types.StructType = 
StructType(
  StructField(value,IntegerType,false), 
  StructField(ReduceAggregator(Customer),
StructType(StructField(id,IntegerType,false),
StructField(person,
  StructType(StructField(name,StringType,true),
  StructField(age,IntegerType,false))
   ,true))
,true))
{code}

The column names are 
{code}
Array(value, ReduceAggregator(Customer))
{code}

But you cannot select the "ReduceAggregatorColumn"
{code}
grouped.select(grouped.columns(1))
org.apache.spark.sql.AnalysisException: cannot resolve 
'`ReduceAggregator(Customer)`' given input columns: [value, 
ReduceAggregator(Customer)];;
'Project ['ReduceAggregator(Customer)]
+- Aggregate [value#338], [value#338, 
reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
Some(newInstance(class Customer)), Some(class Customer), 
Some(StructType(StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 AS 
value#340, if ((isnull(input[0, 

[jira] [Commented] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211846#comment-16211846
 ] 

Russell Spitzer commented on SPARK-22316:
-

You are right, i meant to have createDataset up there. I"ll modify the example.

This shouldn't be Dataset specific since the underlying issue isn't really 
dependent on the encoders, it's just that the name that is automatically made 
cannot be used to select the column. After all a Dataframe is just a 
Dataset[Row] :). 

I haven't really looked into this but i'm guessing it's the "(" characters in 
the auto-generated column name.

Here is an example with Dataframes for good measure
{code}
case class Person(name: String, age: Int)
case class Customer(id: Int, person: Person)
val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85
val grouped = ds.groupByKey(c => c.getInt(0)).reduceGroups( (x,y) => x )
grouped(grouped.columns(1))
/**org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(org.apache.spark.sql.Row)" among (value, 
ReduceAggregator(org.apache.spark.sql.Row));
**/
{code}






> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all 

[jira] [Commented] (SPARK-22209) PySpark does not recognize imports from submodules

2017-10-19 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211772#comment-16211772
 ] 

Bryan Cutler commented on SPARK-22209:
--

As a workaround, you could probably do the following

{code}
from scipy import sparse

def do_stuff(x):
y = sparse.dok_matrix((1, 1))
y[0, 0] = x
return y[0, 0]
{code}

> PySpark does not recognize imports from submodules
> --
>
> Key: SPARK-22209
> URL: https://issues.apache.org/jira/browse/SPARK-22209
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
> Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK 
> 1.8, Centos 6
>Reporter: Joel Croteau
>Priority: Minor
>
> Using submodule syntax inside a PySpark job seems to create issues. For 
> example, the following:
> {code:python}
> import scipy.sparse
> from pyspark import SparkContext, SparkConf
> def do_stuff(x):
> y = scipy.sparse.dok_matrix((1, 1))
> y[0, 0] = x
> return y[0, 0]
> def init_context():
> conf = SparkConf().setAppName("Spark Test")
> sc = SparkContext(conf=conf)
> return sc
> def main():
> sc = init_context()
> data = sc.parallelize([1, 2, 3, 4])
> output_data = data.map(do_stuff)
> print(output_data.collect())
> __name__ == '__main__' and main()
> {code}
> produces this error:
> {noformat}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
> at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
> at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
>  line 177, in main
> process()

[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

2017-10-19 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211750#comment-16211750
 ] 

Bryan Cutler commented on SPARK-22250:
--

[~ferdonline] maybe SPARK-20791 would help you out when working with numpy 
arrays?

> Be less restrictive on type checking
> 
>
> Key: SPARK-22250
> URL: https://issues.apache.org/jira/browse/SPARK-22250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type 
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211733#comment-16211733
 ] 

Sean Owen commented on SPARK-22316:
---

OK, so it's just generally about selecting this column, not needing to achieve 
a particular result. 

You're really operating on Datasets here (right? the example doesn't compile 
without createDataset) rather than DataFrames. DataFrames are the columnar 
abstraction, though Datasets also notionally have a schema. I mean that I 
wouldn't generally look to manipulate "columns" of a Dataset, but explicitly 
turn it back into a DataFrame. And yes as you say you can control the naming 
that way as you like. I think that's the right way to approach this; I always 
avoid relying on generated column names anyway.

I don't know if it's intentional that the 'column' in the Dataset is 
un-selectable as it's an implementation detail or something, or a quirk. Even 
after toDF() I can't seem to select it by what its name seems to be.

> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211670#comment-16211670
 ] 

Dongjoon Hyun commented on SPARK-22306:
---

Thank you, [~budde]!

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Adam Budde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211642#comment-16211642
 ] 

Adam Budde commented on SPARK-22306:


I'll take a look and potentially start on a patch as soon as I get the time. 
Should be the next day or two.

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211633#comment-16211633
 ] 

Dongjoon Hyun commented on SPARK-22306:
---

Hi, [~budde].
Could you take a look this issue in Spark 2.2.0?

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22306:
--
Description: 
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}

*REPRODUCE (branch-2.2)*
In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch is 
good due to SPARK-17729. This is a regression on Spark 2.2 only. By default, 
Parquet Hive table is affected and only Hive may suffer from this.
{code}
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:10
Bucket Columns: [a, b]
Sort Columns:   [Order(col:a, order:1), Order(col:b, order:1)]

scala> sql("SELECT * FROM t").show(false)

hive> DESC FORMATTED t;
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
{code}

  was:
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}

*REPRODUCE (branch-2.2)*
Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
default, Parquet Hive table is affected and only Hive may suffer from this.
{code}
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:10
Bucket Columns: [a, b]
Sort Columns:   [Order(col:a, order:1), Order(col:b, order:1)]

scala> sql("SELECT * FROM t").show(false)

hive> DESC FORMATTED t;
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
{code}


> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is 

[jira] [Commented] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-19 Thread Cosmin Lehene (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211576#comment-16211576
 ] 

Cosmin Lehene commented on SPARK-21033:
---

[~srowen] I'm hitting this while running some large jobs. I'm planning on 
patching and giving this a run today.

> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-19 Thread Cosmin Lehene (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211574#comment-16211574
 ] 

Cosmin Lehene commented on SPARK-21033:
---

[~cloud_fan] Can you update the title and description? It helps, when finding 
the issue through Google, to get an accurate description within JIRA 

{quote}n UnsafeInMemorySorter, one record may take 32 bytes: 1 long for 
pointer, 1 long for key-prefix, and another 2 longs as the temporary buffer for 
radix sort.

In UnsafeExternalSorter, we set the DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD to 
be 1024 * 1024 * 1024 / 2, and hoping the max size of point array to be 8 GB. 
However this is wrong, 1024 * 1024 * 1024 / 2 * 32 is actually 16 GB, and if we 
grow the point array before reach this limitation, we may hit the max-page-size 
error.

This PR fixes this by making DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD 2 times 
smaller, and adding a safe check in 
UnsafeExternalSorter.growPointerArrayIfNecessary to avoid allocating a page 
larger than max page size.

{quote}

> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211561#comment-16211561
 ] 

Sean Owen commented on SPARK-22308:
---

Yeah I wouldn't document it. I disagree that it should be an API in the 
user-facing sense. A consistent and scaladoc'ed approach internally? yes.
The refactoring looks promising if it's really just collecting common code into 
base classes. I don't know the code well enough to evaluate it well, but if 
it's not changing functionality, just improving organization, that's good. If 
it happens to make it easier for someone to reuse, also all the better.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211562#comment-16211562
 ] 

Sean Owen commented on SPARK-21033:
---

I'm seeing that same problem consistently in a deployment [~clehene], though I 
also don't know whether it's resolved by the change above.


> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-19 Thread Nathan Kronenfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211546#comment-16211546
 ] 

Nathan Kronenfeld commented on SPARK-22308:
---

> Sure, but aren't you saying that the ship has sailed and someone already 
> pulled this into a reusable library?

I think not just one, but many people have pulled it out.

And it's only semi-reusable.  If something were to change in spark to make that 
pattern undesirable, because everyone has their own copy, it would take forever 
to propagate (if it even did).

> I'm unclear if you're saying it already has what you want

Basically, it does have what I want already, if I were using FunSuite-based.  
I'm proposing a pretty simple tweak here to allow it to support the other 3 
major ScalaTest patterns

> But documenting it as a supported API is a step too far I think.

If you would like the documentation changes taken out of the PR, I can do that. 
 That is perhaps a larger issue than this PR needs to be.  But if we do that, I 
would suggest making another issue for it (which I would be happy to do).  
Spark should have a documented testing API, whether this one or another, better 
one.  External testing libraries will always be a version or two behind, and 
testing is too fundamental.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211528#comment-16211528
 ] 

Russell Spitzer edited comment on SPARK-22316 at 10/19/17 6:49 PM:
---

I can imagine many reasons I might want to access a column in a DataFrame. 

Lets say I want to add a new column based on the property of my reduced group
{code}
scala> grouped.withColumn("newId", grouped(grouped.columns(1)))
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

Or filter based on the results
{code}
scala> grouped.filter(grouped(grouped.columns(1)) < 5)
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

or to use the literal select function to expand the Struct which gives a 
slightly different 
error
{code}
scala> grouped.select("ReduceAggregator(Customer).*")
org.apache.spark.sql.AnalysisException: cannot resolve 
'ReduceAggregator(Customer).*' give input columns 'id, person, value';
{code}

I should when I say "select" I mean actually access the column by its name for 
any purpose.





was (Author: rspitzer):
I can imagine many reasons i might want to access a column in a dataframe. 

Lets say I want to add a new column based on the property of my reduced group
{code}
scala> grouped.withColumn("newId", grouped(grouped.columns(1)))
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

Or filter based on the results
{code}
scala> grouped.filter(grouped(grouped.columns(1)) < 5)
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

or to use the literal select function to expand the Struct which gives a 
slightly different 
error
{code}
scala> grouped.select("ReduceAggregator(Customer).*")
org.apache.spark.sql.AnalysisException: cannot resolve 
'ReduceAggregator(Customer).*' give input columns 'id, person, value';
{code}

I should when i say "select" I mean actually access the column by it's name for 
any purpose.




> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> 

[jira] [Comment Edited] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211528#comment-16211528
 ] 

Russell Spitzer edited comment on SPARK-22316 at 10/19/17 6:46 PM:
---

I can imagine many reasons i might want to access a column in a dataframe. 

Lets say I want to add a new column based on the property of my reduced group
{code}
scala> grouped.withColumn("newId", grouped(grouped.columns(1)))
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

Or filter based on the results
{code}
scala> grouped.filter(grouped(grouped.columns(1)) < 5)
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

or to use the literal select function to expand the Struct which gives a 
slightly different 
error
{code}
scala> grouped.select("ReduceAggregator(Customer).*")
org.apache.spark.sql.AnalysisException: cannot resolve 
'ReduceAggregator(Customer).*' give input columns 'id, person, value';
{code}

I should when i say "select" I mean actually access the column by it's name for 
any purpose.





was (Author: rspitzer):
I can imagine many reasons i might want to access a column in a dataframe. 

Lets say I want to add a new column based on the property of my reduced group
{code}
scala> grouped.withColumn("newId", grouped(grouped.columns(1)))
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

Or filter based on the results
{code}
scala> grouped.filter(grouped(grouped.columns(1)) < 5)
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

I should when i say "select" I mean actually access the column by it's name for 
any purpose.




> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 

[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22306:
--
Description: 
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}

*REPRODUCE (branch-2.2)*
Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
default, Parquet Hive table is affected and only Hive may suffer from this.
{code}
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:10
Bucket Columns: [a, b]
Sort Columns:   [Order(col:a, order:1), Order(col:b, order:1)]

scala> sql("SELECT * FROM t").show(false)

hive> DESC FORMATTED t;
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
{code}

  was:
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}

*REPRODUCE (branch-2.2)*
Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
default, Parquet Hive table is affected and only Hive may suffer from this.
{code}
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:10
Bucket Columns: [a, b]
Sort Columns:   [Order(col:a, order:1), Order(col:b, order:1)]

scala> sql("DESC FORMATTED t").show(false)
// no information about buckets.

hive> DESC FORMATTED t;
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
{code}


> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY 

[jira] [Commented] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211528#comment-16211528
 ] 

Russell Spitzer commented on SPARK-22316:
-

I can imagine many reasons i might want to access a column in a dataframe. 

Lets say I want to add a new column based on the property of my reduced group
{code}
scala> grouped.withColumn("newId", grouped(grouped.columns(1)))
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

Or filter based on the results
{code}
scala> grouped.filter(grouped(grouped.columns(1)) < 5)
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"ReduceAggregator(Customer)" among (value, ReduceAggregator(Customer));
{code}

I should when i say "select" I mean actually access the column by it's name for 
any purpose.




> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-19 Thread Cosmin Lehene (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211526#comment-16211526
 ] 

Cosmin Lehene commented on SPARK-21033:
---

I think this may be responsible for other problems, such as not being able to 
allocate memory while running in a container as well as getting killed from 
exceeding max memory.

{noformat}
17/10/19 18:15:39 INFO memory.TaskMemoryManager: Memory used in task 6317340
17/10/19 18:15:39 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@2c98b15b: 32.0 KB
17/10/19 18:15:39 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@566144b7: 
64.0 KB
17/10/19 18:15:39 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@ea479ad: 13.3 
GB
17/10/19 18:15:39 INFO memory.TaskMemoryManager: 0 bytes of memory were used by 
task 6317340 but are not associated with specific consumers
17/10/19 18:15:39 INFO memory.TaskMemoryManager: 14496792576 bytes of memory 
are used for execution and 198127044 bytes of memory are used for storage
17/10/19 18:15:39 ERROR executor.Executor: Exception in task 6.0 in stage 320.2 
(TID 6317340)
java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:126)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:153)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:120)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:82)
at 
org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown
 Source)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:392)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:389)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211518#comment-16211518
 ] 

Dongjoon Hyun commented on SPARK-22306:
---

When we release 2.2.1, we had better give a warning about this regression.

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("DESC FORMATTED t").show(false)
> // no information about buckets.
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22306:
--
Priority: Critical  (was: Major)

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("DESC FORMATTED t").show(false)
> // no information about buckets.
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22306:
--
Description: 
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}

*REPRODUCE (branch-2.2)*
Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
default, Parquet Hive table is affected and only Hive may suffer from this.
{code}
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:10
Bucket Columns: [a, b]
Sort Columns:   [Order(col:a, order:1), Order(col:b, order:1)]

scala> sql("DESC FORMATTED t").show(false)
// no information about buckets.

hive> DESC FORMATTED t;
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
{code}

  was:
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}

*REPRODUCE (branch-2.2)*
Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
default, Parquet Hive table is affected and only Hive may suffer from this.
{code}
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:10
Bucket Columns: [a, b]
Sort Columns:   [Order(col:a, order:1), Order(col:b, order:1)]


{code}


> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   

[jira] [Commented] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211507#comment-16211507
 ] 

Sean Owen commented on SPARK-22316:
---

Why are you trying to select it -- why is it a bug?

> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22306:
--
Description: 
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}

*REPRODUCE (branch-2.2)*
Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
default, Parquet Hive table is affected and only Hive may suffer from this.
{code}
hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
INTO 10 BUCKETS STORED AS PARQUET;
hive> INSERT INTO t VALUES('a','b');
hive> DESC FORMATTED t;
...
Num Buckets:10
Bucket Columns: [a, b]
Sort Columns:   [Order(col:a, order:1), Order(col:b, order:1)]


{code}

  was:
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}



> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> Spark 2.3 (master) branch is good. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-19 Thread Russell Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Spitzer updated SPARK-22316:

Summary: Cannot Select ReducedAggregator Column  (was: Cannot Select 
ReducedAggreagtor Column)

> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22316) Cannot Select ReducedAggreagtor Column

2017-10-19 Thread Russell Spitzer (JIRA)
Russell Spitzer created SPARK-22316:
---

 Summary: Cannot Select ReducedAggreagtor Column
 Key: SPARK-22316
 URL: https://issues.apache.org/jira/browse/SPARK-22316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Russell Spitzer
Priority: Minor


Given a dataset which has been run through reduceGroups like this
{code}
case class Person(name: String, age: Int)
case class Customer(id: Int, person: Person)
val ds = spark.createDataFrame(Seq(Customer(1,Person("russ", 85)))
val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
{code}

We end up with a Dataset with the schema

{code}
 org.apache.spark.sql.types.StructType = 
StructType(
  StructField(value,IntegerType,false), 
  StructField(ReduceAggregator(Customer),
StructType(StructField(id,IntegerType,false),
StructField(person,
  StructType(StructField(name,StringType,true),
  StructField(age,IntegerType,false))
   ,true))
,true))
{code}

The column names are 
{code}
Array(value, ReduceAggregator(Customer))
{code}

But you cannot select the "ReduceAggregatorColumn"
{code}
grouped.select(grouped.columns(1))
org.apache.spark.sql.AnalysisException: cannot resolve 
'`ReduceAggregator(Customer)`' given input columns: [value, 
ReduceAggregator(Customer)];;
'Project ['ReduceAggregator(Customer)]
+- Aggregate [value#338], [value#338, 
reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
Some(newInstance(class Customer)), Some(class Customer), 
Some(StructType(StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 AS 
value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) null 
else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).id AS id#195, person, if 
(isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, true]._2)).person)) 
null else named_struct(name, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).person).name, true), age, 
assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person)) null else named_struct(name, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person).name, true), age, 
assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
true])).person).age) AS person#196, StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true), true, 0, 0) AS 
ReduceAggregator(Customer)#346]
   +- AppendColumns , class Customer, 
[StructField(id,IntegerType,false), 
StructField(person,StructType(StructField(name,StringType,true), 
StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
[input[0, int, false] AS value#338]
  +- LocalRelation [id#197, person#198]
{code}

You can work around this by using "toDF" to rename the column

{code}
scala> grouped.toDF("key", "reduced").select("reduced")
res55: org.apache.spark.sql.DataFrame = [reduced: struct>]
{code}

I think that all invocations of 
{code}
ds.select(ds.columns(i))
{code}
For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22314) Accessing Hive UDFs defined without 'USING JAR' from Spark

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211497#comment-16211497
 ] 

Sean Owen commented on SPARK-22314:
---

If it's on the classpath, it's visible to the user. This is just two ways to 
get a UDF on a classpath. I don't think this relates to what you're trying to 
accomplish.

> Accessing Hive UDFs defined without 'USING JAR' from Spark 
> ---
>
> Key: SPARK-22314
> URL: https://issues.apache.org/jira/browse/SPARK-22314
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Matyas Orhidi
>
> When defining UDF functions in Hive it is possible to load the UDF jar(s) 
> from a shared location e.g. from hive.reloadable.aux.jars.path, and then use 
> the CREATE FUNCTION statement:
> {{CREATE FUNCTION  AS '';}}
> These UDFs are not working from Spark unless you use the 
> {{CREATE FUNCTION  AS '' 
> USING JAR 'hdfs:///';}}
> command to create the Hive UDF function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22306:
--
Summary: INFER_AND_SAVE overwrites important metadata in Parquet Metastore 
table  (was: INFER_AND_SAVE overwrites important metadata in Metastore)

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities

2017-10-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211450#comment-16211450
 ] 

Apache Spark commented on SPARK-20393:
--

User 'ambauma' has created a pull request for this issue:
https://github.com/apache/spark/pull/19538

> Strengthen Spark to prevent XSS vulnerabilities
> ---
>
> Key: SPARK-20393
> URL: https://issues.apache.org/jira/browse/SPARK-20393
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 2.0.2, 2.1.0
>Reporter: Nicholas Marion
>Assignee: Nicholas Marion
>  Labels: security
> Fix For: 2.1.2, 2.2.0
>
>
> Using IBM Security AppScan Standard, we discovered several easy to recreate 
> MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI 
> application and these vulnerabilities were found to exist in Spark version 
> 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting 
> attack is not really an attack on the Spark server as much as an attack on 
> the end user, taking advantage of their trust in the Spark server to get them 
> to click on a URL like the ones in the examples below.  So whether the user 
> could or could not change lots of stuff on the Spark server is not the key 
> point.  It is an attack on the user themselves.  If they click the link the 
> script could run in their browser and comprise their device.  Once the 
> browser is compromised it could submit Spark requests but it also might not.
> https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/
> {quote}
> Request: GET 
> /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a--
> _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> HTTP/1.1
> Excerpt from response: No running application with ID 
> Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> 
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET 
> /history/app-20161012202114-0038/stages/stage?id=1=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> k.pageSize=100 HTTP/1.1
> Excerpt from response: Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET /log?appId=app-20170113131903-=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> eLength=0 HTTP/1.1
> Excerpt from response:  Bytes 0-0 of 0 of 
> /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content-
> Type: multipart/related; boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> {quote}
> security@apache was notified and recommended a PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22314) Accessing Hive UDFs defined without 'USING JAR' from Spark

2017-10-19 Thread Matyas Orhidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matyas Orhidi updated SPARK-22314:
--
Issue Type: Improvement  (was: Bug)

> Accessing Hive UDFs defined without 'USING JAR' from Spark 
> ---
>
> Key: SPARK-22314
> URL: https://issues.apache.org/jira/browse/SPARK-22314
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Matyas Orhidi
>
> When defining UDF functions in Hive it is possible to load the UDF jar(s) 
> from a shared location e.g. from hive.reloadable.aux.jars.path, and then use 
> the CREATE FUNCTION statement:
> {{CREATE FUNCTION  AS '';}}
> These UDFs are not working from Spark unless you use the 
> {{CREATE FUNCTION  AS '' 
> USING JAR 'hdfs:///';}}
> command to create the Hive UDF function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22315) Check for version match between R package and JVM

2017-10-19 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-22315:
-

 Summary: Check for version match between R package and JVM
 Key: SPARK-22315
 URL: https://issues.apache.org/jira/browse/SPARK-22315
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.1
Reporter: Shivaram Venkataraman


With the release of SparkR on CRAN we could have scenarios where users have a 
newer version of package when compared to the Spark cluster they are connecting 
to.

We should print appropriate warnings on either (a) connecting to a different 
version R Backend (b) connecting to a Spark master running a different version 
of Spark (this should ideally happen inside Scala ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-19 Thread David Malinge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211433#comment-16211433
 ] 

David Malinge commented on SPARK-22306:
---

With:
spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE
spark.sql.hive.convertMetastoreParquet=false

The issue does not happen. Leaves the bucketing/sorting/owner as is.

> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22314) Accessing Hive UDFs defined without 'USING JAR' from Spark

2017-10-19 Thread Matyas Orhidi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211434#comment-16211434
 ] 

Matyas Orhidi commented on SPARK-22314:
---

There are use cases when the implementation should be hidden from the end user, 
e.g. the UDF contains sensitive code like app level encryption.

> Accessing Hive UDFs defined without 'USING JAR' from Spark 
> ---
>
> Key: SPARK-22314
> URL: https://issues.apache.org/jira/browse/SPARK-22314
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Matyas Orhidi
>
> When defining UDF functions in Hive it is possible to load the UDF jar(s) 
> from a shared location e.g. from hive.reloadable.aux.jars.path, and then use 
> the CREATE FUNCTION statement:
> {{CREATE FUNCTION  AS '';}}
> These UDFs are not working from Spark unless you use the 
> {{CREATE FUNCTION  AS '' 
> USING JAR 'hdfs:///';}}
> command to create the Hive UDF function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22314) Accessing Hive UDFs defined without 'USING JAR' from Spark

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211422#comment-16211422
 ] 

Sean Owen commented on SPARK-22314:
---

What is the problem here? either way you must specify the JAR location. In the 
first case you use a Hive mechanism that isn't what's used in Spark.

> Accessing Hive UDFs defined without 'USING JAR' from Spark 
> ---
>
> Key: SPARK-22314
> URL: https://issues.apache.org/jira/browse/SPARK-22314
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Matyas Orhidi
>
> When defining UDF functions in Hive it is possible to load the UDF jar(s) 
> from a shared location e.g. from hive.reloadable.aux.jars.path, and then use 
> the CREATE FUNCTION statement:
> {{CREATE FUNCTION  AS '';}}
> These UDFs are not working from Spark unless you use the 
> {{CREATE FUNCTION  AS '' 
> USING JAR 'hdfs:///';}}
> command to create the Hive UDF function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22314) Accessing Hive UDFs defined without 'USING JAR' from Spark

2017-10-19 Thread Matyas Orhidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matyas Orhidi updated SPARK-22314:
--
Description: 
When defining UDF functions in Hive it is possible to load the UDF jar(s) from 
a shared location e.g. from hive.reloadable.aux.jars.path, and then use the 
CREATE FUNCTION statement:

{{CREATE FUNCTION  AS '';}}

These UDFs are not working from Spark unless you use the 

{{CREATE FUNCTION  AS '' USING 
JAR
  'hdfs:///';}}
command to create the Hive UDF function.


  was:
When defining UDF functions in Hive it is possible to load the UDF jar(s) from 
a shared location e.g. from hive.reloadable.aux.jars.path, and then use the 
CREATE FUNCTION statement:

{{CREATE FUNCTION  AS '';}}

These UDFs are not working from Spark unless you use the 
{{CREATE FUNCTION  AS '' USING 
JAR
  'hdfs:///';}}
command to create the Hive UDF function.



> Accessing Hive UDFs defined without 'USING JAR' from Spark 
> ---
>
> Key: SPARK-22314
> URL: https://issues.apache.org/jira/browse/SPARK-22314
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Matyas Orhidi
>
> When defining UDF functions in Hive it is possible to load the UDF jar(s) 
> from a shared location e.g. from hive.reloadable.aux.jars.path, and then use 
> the CREATE FUNCTION statement:
> {{CREATE FUNCTION  AS '';}}
> These UDFs are not working from Spark unless you use the 
> {{CREATE FUNCTION  AS '' 
> USING JAR
>   'hdfs:///';}}
> command to create the Hive UDF function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22314) Accessing Hive UDFs defined without 'USING JAR' from Spark

2017-10-19 Thread Matyas Orhidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matyas Orhidi updated SPARK-22314:
--
Description: 
When defining UDF functions in Hive it is possible to load the UDF jar(s) from 
a shared location e.g. from hive.reloadable.aux.jars.path, and then use the 
CREATE FUNCTION statement:

{{CREATE FUNCTION  AS '';}}

These UDFs are not working from Spark unless you use the 

{{CREATE FUNCTION  AS '' USING 
JAR 'hdfs:///';}}
command to create the Hive UDF function.


  was:
When defining UDF functions in Hive it is possible to load the UDF jar(s) from 
a shared location e.g. from hive.reloadable.aux.jars.path, and then use the 
CREATE FUNCTION statement:

{{CREATE FUNCTION  AS '';}}

These UDFs are not working from Spark unless you use the 

{{CREATE FUNCTION  AS '' USING 
JAR
  'hdfs:///';}}
command to create the Hive UDF function.



> Accessing Hive UDFs defined without 'USING JAR' from Spark 
> ---
>
> Key: SPARK-22314
> URL: https://issues.apache.org/jira/browse/SPARK-22314
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Matyas Orhidi
>
> When defining UDF functions in Hive it is possible to load the UDF jar(s) 
> from a shared location e.g. from hive.reloadable.aux.jars.path, and then use 
> the CREATE FUNCTION statement:
> {{CREATE FUNCTION  AS '';}}
> These UDFs are not working from Spark unless you use the 
> {{CREATE FUNCTION  AS '' 
> USING JAR 'hdfs:///';}}
> command to create the Hive UDF function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22314) Accessing Hive UDFs defined without 'USING JAR' from Spark

2017-10-19 Thread Matyas Orhidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matyas Orhidi updated SPARK-22314:
--
Summary: Accessing Hive UDFs defined without 'USING JAR' from Spark   (was: 
Accessing Hive UDFs from Spark originally defined without 'USING JAR')

> Accessing Hive UDFs defined without 'USING JAR' from Spark 
> ---
>
> Key: SPARK-22314
> URL: https://issues.apache.org/jira/browse/SPARK-22314
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Matyas Orhidi
>
> When defining UDF functions in Hive it is possible to load the UDF jar(s) 
> from a shared location e.g. from hive.reloadable.aux.jars.path, and then use 
> the CREATE FUNCTION statement:
> {{CREATE FUNCTION  AS '';}}
> These UDFs are not working from Spark unless you use the 
> {{CREATE FUNCTION  AS '' 
> USING JAR
>   'hdfs:///';}}
> command to create the Hive UDF function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22314) Accessing Hive UDFs from Spark originally defined without 'USING JAR'

2017-10-19 Thread Matyas Orhidi (JIRA)
Matyas Orhidi created SPARK-22314:
-

 Summary: Accessing Hive UDFs from Spark originally defined without 
'USING JAR'
 Key: SPARK-22314
 URL: https://issues.apache.org/jira/browse/SPARK-22314
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Matyas Orhidi


When defining UDF functions in Hive it is possible to load the UDF jar(s) from 
a shared location e.g. from hive.reloadable.aux.jars.path, and then use the 
CREATE FUNCTION statement:

{{CREATE FUNCTION  AS '';}}

These UDFs are not working from Spark unless you use the 
{{CREATE FUNCTION  AS '' USING 
JAR
  'hdfs:///';}}
command to create the Hive UDF function.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211375#comment-16211375
 ] 

Dongjoon Hyun commented on SPARK-22306:
---

Hi. [~WhoisDavid].
Spark 2.3 seems to be okay due to SPARK-17729.
I can check the bucket spec is preserved like `Some(10 buckets, bucket columns: 
[a, b], sort columns: [a, b])`.
You can download the snapshot and try that.

> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

2017-10-19 Thread Andrey Yakovenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211354#comment-16211354
 ] 

Andrey Yakovenko commented on SPARK-22307:
--

Id is not null for all records but Parent.Id, Parent.ParentId could be null 
for some records. I'm expecting that in case of Parent null evaluation of 
Parent.Id IN (something) is null => false and then not(Parent.Id IN 
(something)) => true. I'm not a guru in SQL standards so you probably right. 

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-19 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211353#comment-16211353
 ] 

Kazuaki Ishizaki commented on SPARK-22284:
--

I found [this JIRA|https://issues.apache.org/jira/browse/SPARK-18207] and 
realized that I created the PR.
According to the attached code, I imagine that this case uses more complicated 
columns in a row.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-19 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211315#comment-16211315
 ] 

Kazuaki Ishizaki commented on SPARK-22284:
--

Thank you for uploading the generated code. This is very helpful. I understood 
that the very huge method is generated for calculating hash value from many 
many String columns in struct or array.
If I correctly remember it, this problem is not be fixed in 2.2 or master.

I will try to create a repro and submit a PR to hopefully fix this.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211301#comment-16211301
 ] 

Sean Owen commented on SPARK-22308:
---

Sure, but aren't you saying that the ship has sailed and someone already pulled 
this out into a reusable library? I don't think we'd maintain two. I'm unclear 
if you're saying it already has what you want, just separately, or whether it's 
not what you want. If it's the thing you're trying to create, why not use it?

Even if unsupported, I think you're welcome to try to use the Spark framework 
if you want. If there's a small tweak that makes it a lot more usable, OK. But 
documenting it as a supported API is a step too far I think.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-19 Thread Nathan Kronenfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211291#comment-16211291
 ] 

Nathan Kronenfeld commented on SPARK-22308:
---

That's a perfect example of what I'm talking about.

Take a look at 
https://github.com/holdenk/spark-testing-base/blob/master/src/main/2.0/scala/com/holdenkarau/spark/testing/SharedSparkContext.scala
 - it's essentially a copy of SharedSparkContext in the Spark code base,.

Which isn't even necessary now, as SharedSparkContext already _is_ published.

If you take a look at my associated PR, it's barely messing wth 
SharedSparkContext at all - that is already pretty much fine.  Even 
SharedSQLContext is mostly fine.

All it's doing is pulling apart SharedSQLContext into the part that needs to be 
a FunSuite and the part that just needs to be a Suite (into 
SharedSessionContext) so that it can be used with other styles of tests.

In other words, we already are publishing this stuff, I'm just trying to make 
it more usable.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211276#comment-16211276
 ] 

Dongjoon Hyun commented on SPARK-22306:
---

Thank you for confirming.
Sorry for asking again, could you turn off 
`spark.sql.hive.convertMetastoreParquet=false`?

> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-19 Thread David Malinge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Malinge updated SPARK-22306:
--
Environment: 
Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
Spark 2.2.0

  was:
Hive 2.3.0 (PostgresQL metastore)
Spark 2.2.0


> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211239#comment-16211239
 ] 

Marco Gaido commented on SPARK-22307:
-

Have you checked if the missing records contain null as a value for `col1`? If 
so, there is no bug and this is an expected behavior according to SQL 
standards, since operations involving nulls are evaluated to null and null is 
considered false in conditions. Thus nulls are filtered in both cases correctly.

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211228#comment-16211228
 ] 

Sean Owen commented on SPARK-22308:
---

There are purpose-built Spark testing frameworks that seem built for what 
you're doing like 
https://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
  Spark's internal tests are not designed for use in this way and it's not a 
goal to expose them in that way.

Use spark-testing-base?

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-19 Thread Nathan Kronenfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211207#comment-16211207
 ] 

Nathan Kronenfeld commented on SPARK-22308:
---

I can't disagree more strongly.

What you get then is people just copy-pasting this code into their own code 
base.

This is what we've been doing for years. We tried a number of iterations of our 
own test harness that kept failing for one reason or another, until we finally 
copied what Spark core was doing, and it all worked.

I was excited to learn that the test jars were finally being published, so we 
could stop keeping our own copy of SharedSparkContext.  But now we either can't 
use SharedSQLContext, or have to make our tests conform to Spark's patterns

To say to the Spark community, 'sure you can use our stuff, but you're on your 
own for testing" is well, not helpful in the least.  Testing is an integral 
part of using any product.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-19 Thread Cheburakshu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211200#comment-16211200
 ] 

Cheburakshu commented on SPARK-22295:
-

But, I had one problem with Chi Square on this data set. I will not choose
BMI (mass in dataset) as a prominent feature. I don't know why.




> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>

[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-19 Thread Cheburakshu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211199#comment-16211199
 ] 

Cheburakshu commented on SPARK-22295:
-

Yes. I realized that and marked my ticket as "Invalid"




> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol="class").fit(df)
> except:
> 

[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2017-10-19 Thread Kun Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211183#comment-16211183
 ] 

Kun Liu commented on SPARK-20153:
-

Steve: AFAIK, "Amazon EMR does not currently support use of the Apache Hadoop 
S3A file system."
https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/ 

> Support Multiple aws credentials in order to access multiple Hive on S3 table 
> in spark application 
> ---
>
> Key: SPARK-20153
> URL: https://issues.apache.org/jira/browse/SPARK-20153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> I need to access multiple hive tables in my spark application where each hive 
> table is 
> 1- an external table with data sitting on S3
> 2- each table is own by a different AWS user so I need to provide different 
> AWS credentials. 
> I am familiar with setting the aws credentials in the hadoop configuration 
> object but that does not really help me because I can only set one pair of 
> (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )
> From my research , there is no easy or elegant way to do this in spark .
> Why is that ?  
> How do I address this use case?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-19 Thread Ben (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben updated SPARK-22284:

Attachment: 64KB Error.log

Sure, I just added it as an attachment.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-19 Thread David Malinge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211129#comment-16211129
 ] 

David Malinge edited comment on SPARK-22306 at 10/19/17 2:32 PM:
-

[~dongjoon]
Seems to be set to false by default (or at least in my current setup)


was (Author: whoisdavid):
[~dongjoon]
spark.sql.hive.convertMetastoreParquet.mergeSchema=false
Seems to be set to false by default (or at least in my current setup)

> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-19 Thread David Malinge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211129#comment-16211129
 ] 

David Malinge commented on SPARK-22306:
---

[~dongjoon]
spark.sql.hive.convertMetastoreParquet.mergeSchema=false
Seems to be set to false by default (or at least in my current setup)

> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

2017-10-19 Thread Yuval Degani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211038#comment-16211038
 ] 

Yuval Degani commented on SPARK-9:
--

[~rvesse], thanks for taking the time to review.

Regarding performance testing - we use a suite of common benchmarks in our day 
to day work (HiBench, TPC-DS, etc...), as well as several customer applications.
We see anywhere between 5% to 120% speedups in runtime, depending on the type 
and size of the workload.
In terms of scale, we test multiple variations in the range of 2 to 128 
physical machines, and with different link speeds (10-100Gbps).

Regarding compression, we share the same experience with setting 
{{spark.shuffle.compress=false}}. It causes TCP/IP to perform better in most 
cases, and since the shuffle size is significantly larger, RDMA shows even a 
bigger advantage over TCP/IP in that case.
For example, in TeraSort, when comparing TCP/IP to RDMA with 
{{spark.shuffle.compress=false}}, we see about %45 speedup in the total 
runtime. Running the same test with {{spark.shuffle.compress=true}}, yields 
around %20 speedup for RDMA, as the shuffle size reduces significantly.

Regarding licensing of RDMA dependencies, I'll try to do some proper drill down 
into the issues raised.
In general, {{librdmacm}} is part of the linux source code, and is not 
statically linked with {{libdisni}}. I'm not a licensing expert in any way, but 
I presume that other Apache projects depend on linux libraries at different 
capacities.
There's at least one Apache project that I'm aware of that depends on 
{{librdmacm}} - [Apache Qpid|http://qpid.apache.org/] - so I think this 
constitutes a precedent for having a dependency on {{librdmacm}}.


> SPIP: RDMA Accelerated Shuffle Engine
> -
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuval Degani
> Attachments: 
> SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer

2017-10-19 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210964#comment-16210964
 ] 

Kazuaki Ishizaki commented on SPARK-22254:
--

If noone has started working for this, I will work for this.

> clean up the implementation of `growToSize` in CompactBuffer
> 
>
> Key: SPARK-22254
> URL: https://issues.apache.org/jira/browse/SPARK-22254
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Feng Liu
>
> two issues:
> 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH `
> 2. I believe some `-2` were introduced because `Integer.Max_Value` was used 
> previously. We should make the calculation of newArrayLen concise. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-19 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210943#comment-16210943
 ] 

Kazuaki Ishizaki commented on SPARK-22284:
--

Can you attach the generated code? It may help which `SpecificUnsafeProjection` 
may cause this exception.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210898#comment-16210898
 ] 

Marco Gaido edited comment on SPARK-20617 at 10/19/17 11:43 AM:


This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:

{code:java}
spark.sql("select null in ('a')")
{code}

Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.


was (Author: mgaido):
This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:

{code:java}
// Some comments here
spark.sql("select null in ('a')")
{code}

Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210898#comment-16210898
 ] 

Marco Gaido edited comment on SPARK-20617 at 10/19/17 11:43 AM:


This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:

{code:java}
// Some comments here
spark.sql("select null in ('a')")
{code}

Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.


was (Author: mgaido):
This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:
```
spark.sql("select null in ('a')")
```
Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-20617.
-
Resolution: Not A Bug

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210898#comment-16210898
 ] 

Marco Gaido commented on SPARK-20617:
-

This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:
```
spark.sql("select null in ('a')")
```
Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6685) Use DSYRK to compute AtA in ALS

2017-10-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210888#comment-16210888
 ] 

Apache Spark commented on SPARK-6685:
-

User 'mpjlu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19536

> Use DSYRK to compute AtA in ALS
> ---
>
> Key: SPARK-6685
> URL: https://issues.apache.org/jira/browse/SPARK-6685
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We 
> should switch to DSYRK to use native BLAS to accelerate the computation. The 
> factors should remain dense vectors. And we can pre-allocate a buffer to 
> stack vectors and do Level 3. This requires some benchmark to demonstrate the 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod`

2017-10-19 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22309:
-
Description: 
1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
is redundant.


  was:
1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
is redundant.
2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in {{RandomForest}}


> Remove unused param in `LDAModel.getTopicDistributionMethod`
> 
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod`

2017-10-19 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22309:
-
Summary: Remove unused param in `LDAModel.getTopicDistributionMethod`  
(was: Remove unused param in `LDAModel.getTopicDistributionMethod` & destory 
`nodeToFeaturesBc` in RandomForest)

> Remove unused param in `LDAModel.getTopicDistributionMethod`
> 
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.
> 2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in 
> {{RandomForest}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest

2017-10-19 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210838#comment-16210838
 ] 

zhengruifeng commented on SPARK-22309:
--

OK, I will separate them

> Remove unused param in `LDAModel.getTopicDistributionMethod` & destory 
> `nodeToFeaturesBc` in RandomForest
> -
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.
> 2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in 
> {{RandomForest}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22313) Mark/print deprecation warnings as DeprecationWarning for deprecated APIs

2017-10-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22313:


Assignee: Apache Spark

> Mark/print deprecation warnings as DeprecationWarning for deprecated APIs
> -
>
> Key: SPARK-22313
> URL: https://issues.apache.org/jira/browse/SPARK-22313
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, some {{warnings.warn(...)}} for deprecation use the category 
> {{UserWarning}} as by default.
> If we use {{DeprecationWarning}}, this can actually be useful in IDE, in my 
> case, PyCharm. Please see before and after in the PR. I happened to open a PR 
> first to show my idea.
> Also, looks some deprecated functions do not have this warnings. It might be 
> better to print out those explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22313) Mark/print deprecation warnings as DeprecationWarning for deprecated APIs

2017-10-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22313:


Assignee: (was: Apache Spark)

> Mark/print deprecation warnings as DeprecationWarning for deprecated APIs
> -
>
> Key: SPARK-22313
> URL: https://issues.apache.org/jira/browse/SPARK-22313
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, some {{warnings.warn(...)}} for deprecation use the category 
> {{UserWarning}} as by default.
> If we use {{DeprecationWarning}}, this can actually be useful in IDE, in my 
> case, PyCharm. Please see before and after in the PR. I happened to open a PR 
> first to show my idea.
> Also, looks some deprecated functions do not have this warnings. It might be 
> better to print out those explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22313) Mark/print deprecation warnings as DeprecationWarning for deprecated APIs

2017-10-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210759#comment-16210759
 ] 

Apache Spark commented on SPARK-22313:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19535

> Mark/print deprecation warnings as DeprecationWarning for deprecated APIs
> -
>
> Key: SPARK-22313
> URL: https://issues.apache.org/jira/browse/SPARK-22313
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, some {{warnings.warn(...)}} for deprecation use the category 
> {{UserWarning}} as by default.
> If we use {{DeprecationWarning}}, this can actually be useful in IDE, in my 
> case, PyCharm. Please see before and after in the PR. I happened to open a PR 
> first to show my idea.
> Also, looks some deprecated functions do not have this warnings. It might be 
> better to print out those explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22313) Mark/print deprecation warnings as DeprecationWarning for deprecated APIs

2017-10-19 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-22313:


 Summary: Mark/print deprecation warnings as DeprecationWarning for 
deprecated APIs
 Key: SPARK-22313
 URL: https://issues.apache.org/jira/browse/SPARK-22313
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, some {{warnings.warn(...)}} for deprecation use the category 
{{UserWarning}} as by default.

If we use {{DeprecationWarning}}, this can actually be useful in IDE, in my 
case, PyCharm. Please see before and after in the PR. I happened to open a PR 
first to show my idea.

Also, looks some deprecated functions do not have this warnings. It might be 
better to print out those explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14371) OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-10-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14371:
-

Assignee: Valeriy Avanesov

> OnlineLDAOptimizer should not collect stats for each doc in mini-batch to 
> driver
> 
>
> Key: SPARK-14371
> URL: https://issues.apache.org/jira/browse/SPARK-14371
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Valeriy Avanesov
> Fix For: 2.3.0
>
>
> See this line: 
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437
> The second element in each row of "stats" is a list with one Vector for each 
> document in the mini-batch.  Those are collected to the driver in this line:
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456
> We should not collect those to the driver.  Rather, we should do the 
> necessary maps and aggregations in a distributed manner.  This will involve 
> modify the Dirichlet expectation implementation.  (This JIRA should be done 
> by someone knowledge about online LDA and Spark.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14371) OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-10-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14371.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> OnlineLDAOptimizer should not collect stats for each doc in mini-batch to 
> driver
> 
>
> Key: SPARK-14371
> URL: https://issues.apache.org/jira/browse/SPARK-14371
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Valeriy Avanesov
> Fix For: 2.3.0
>
>
> See this line: 
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437
> The second element in each row of "stats" is a list with one Vector for each 
> document in the mini-batch.  Those are collected to the driver in this line:
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456
> We should not collect those to the driver.  Rather, we should do the 
> necessary maps and aggregations in a distributed manner.  This will involve 
> modify the Dirichlet expectation implementation.  (This JIRA should be done 
> by someone knowledge about online LDA and Spark.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22188:
-

Assignee: Krishna Pandey

> Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
> ---
>
> Key: SPARK-22188
> URL: https://issues.apache.org/jira/browse/SPARK-22188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Krishna Pandey
>Assignee: Krishna Pandey
>Priority: Minor
>  Labels: security
> Fix For: 2.3.0
>
>
> Below HTTP Response headers can be added to improve security.
> The HTTP *Strict-Transport-Security* response header (often abbreviated as 
> HSTS) is a security feature that lets a web site tell browsers that it should 
> only be communicated with using HTTPS, instead of using HTTP.
> *Note:* The Strict-Transport-Security header is ignored by the browser when 
> your site is accessed using HTTP; this is because an attacker may intercept 
> HTTP connections and inject the header or remove it. When your site is 
> accessed over HTTPS with no certificate errors, the browser knows your site 
> is HTTPS capable and will honor the Strict-Transport-Security header.
> *An example scenario*
> You log into a free WiFi access point at an airport and start surfing the 
> web, visiting your online banking service to check your balance and pay a 
> couple of bills. Unfortunately, the access point you're using is actually a 
> hacker's laptop, and they're intercepting your original HTTP request and 
> redirecting you to a clone of your bank's site instead of the real thing. Now 
> your private data is exposed to the hacker.
> Strict Transport Security resolves this problem; as long as you've accessed 
> your bank's web site once using HTTPS, and the bank's web site uses Strict 
> Transport Security, your browser will know to automatically use only HTTPS, 
> which prevents hackers from performing this sort of man-in-the-middle attack.
> *Syntax:*
> Strict-Transport-Security: max-age=
> Strict-Transport-Security: max-age=; includeSubDomains
> Strict-Transport-Security: max-age=; preload
> Read more at 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security
> The HTTP *X-XSS-Protection* response header is a feature of Internet 
> Explorer, Chrome and Safari that stops pages from loading when they detect 
> reflected cross-site scripting (XSS) attacks.
> *Syntax:*
> X-XSS-Protection: 0
> X-XSS-Protection: 1
> X-XSS-Protection: 1; mode=block
> X-XSS-Protection: 1; report=
> Read more at 
> http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
> The HTTP *X-Content-Type-Options* response header is used to protect against 
> MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
> allows users to upload content to a website however the user disguises a 
> particular file type as something else. This can give them the opportunity to 
> perform cross-site scripting and compromise the website. Read more at 
> https://www.keycdn.com/support/x-content-type-options/ and 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22188.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19419
[https://github.com/apache/spark/pull/19419]

> Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
> ---
>
> Key: SPARK-22188
> URL: https://issues.apache.org/jira/browse/SPARK-22188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Krishna Pandey
>Priority: Minor
>  Labels: security
> Fix For: 2.3.0
>
>
> Below HTTP Response headers can be added to improve security.
> The HTTP *Strict-Transport-Security* response header (often abbreviated as 
> HSTS) is a security feature that lets a web site tell browsers that it should 
> only be communicated with using HTTPS, instead of using HTTP.
> *Note:* The Strict-Transport-Security header is ignored by the browser when 
> your site is accessed using HTTP; this is because an attacker may intercept 
> HTTP connections and inject the header or remove it. When your site is 
> accessed over HTTPS with no certificate errors, the browser knows your site 
> is HTTPS capable and will honor the Strict-Transport-Security header.
> *An example scenario*
> You log into a free WiFi access point at an airport and start surfing the 
> web, visiting your online banking service to check your balance and pay a 
> couple of bills. Unfortunately, the access point you're using is actually a 
> hacker's laptop, and they're intercepting your original HTTP request and 
> redirecting you to a clone of your bank's site instead of the real thing. Now 
> your private data is exposed to the hacker.
> Strict Transport Security resolves this problem; as long as you've accessed 
> your bank's web site once using HTTPS, and the bank's web site uses Strict 
> Transport Security, your browser will know to automatically use only HTTPS, 
> which prevents hackers from performing this sort of man-in-the-middle attack.
> *Syntax:*
> Strict-Transport-Security: max-age=
> Strict-Transport-Security: max-age=; includeSubDomains
> Strict-Transport-Security: max-age=; preload
> Read more at 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security
> The HTTP *X-XSS-Protection* response header is a feature of Internet 
> Explorer, Chrome and Safari that stops pages from loading when they detect 
> reflected cross-site scripting (XSS) attacks.
> *Syntax:*
> X-XSS-Protection: 0
> X-XSS-Protection: 1
> X-XSS-Protection: 1; mode=block
> X-XSS-Protection: 1; report=
> Read more at 
> http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
> The HTTP *X-Content-Type-Options* response header is used to protect against 
> MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
> allows users to upload content to a website however the user disguises a 
> particular file type as something else. This can give them the opportunity to 
> perform cross-site scripting and compromise the website. Read more at 
> https://www.keycdn.com/support/x-content-type-options/ and 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210692#comment-16210692
 ] 

Sean Owen commented on SPARK-22308:
---

These are internal test classes and not any kind of API, so I don't think we 
should design, doc and expose them for that purpose. Other projects should 
create their own test harnesses or use existing Spark test harnesses.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22288) Tricky interaction between closure-serialization and inheritance results in confusing failure

2017-10-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22288.
---
Resolution: Not A Problem

OK, fine to note here, but I think this is just a Java issue. You're right re: 
Kryo.

> Tricky interaction between closure-serialization and inheritance results in 
> confusing failure
> -
>
> Key: SPARK-22288
> URL: https://issues.apache.org/jira/browse/SPARK-22288
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> Documenting this since I've run into it a few times; [full repro / discussion 
> here|https://github.com/ryan-williams/spark-bugs/tree/serde].
> Given 3 possible super-classes:
> {code}
> class Super1(n: Int)
> class Super2(n: Int) extends Serializable
> class Super3
> {code}
> A subclass that passes a closure to an RDD operation (e.g. {{map}} or 
> {{filter}}), where the closure references one of the subclass's fields, will 
> throw an {{java.io.InvalidClassException: …; no valid constructor}} exception 
> when the subclass extends {{Super1}} but not {{Super2}} or {{Super3}}. 
> Referencing method-local variables (instead of fields) is fine in all cases:
> {code}
> class App extends Super1(4) with Serializable {
>   val s = "abc"
>   def run(): Unit = {
> val sc = new SparkContext(new SparkConf().set("spark.master", 
> "local[4]").set("spark.app.name", "serde-test"))
> try {
>   sc
> .parallelize(1 to 10)
> .filter(Main.fn(_, s))  // danger! closure references `s`, crash 
> ensues
> .collect()  // driver stack-trace points here
> } finally {
>   sc.stop()
> }
>   }
> }
> object App {
>   def main(args: Array[String]): Unit = { new App().run() }
>   def fn(i: Int, s: String): Boolean = i % 2 == 0
> }
> {code}
> The task-failure stack trace looks like:
> {code}
> java.io.InvalidClassException: com.MyClass; no valid constructor
>   at 
> java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
>   at 
> java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:790)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1782)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
> {code}
> and a driver stack-trace will point to the first line that initiates a Spark 
> job that exercises the closure/RDD-operation in question.
> Not sure how much this should be considered a problem with Spark's 
> closure-serialization logic vs. Java serialization, but maybe if the former 
> gets looked at or improved (e.g. with 2.12 support), this kind of interaction 
> can be improved upon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest

2017-10-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22309:
--
Priority: Trivial  (was: Minor)

Let's not mix unrelated issues please

> Remove unused param in `LDAModel.getTopicDistributionMethod` & destory 
> `nodeToFeaturesBc` in RandomForest
> -
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.
> 2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in 
> {{RandomForest}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22311) stage api modify the description format, add version api, modify the duration real-time calculation

2017-10-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22311.
---
Resolution: Won't Fix

This is too trivial for a JIRA, and one of the changes is wrong. See comments 
in the PR. Don't bundle unrelated items together like this.

> stage api modify the description format, add version api, modify the duration 
> real-time calculation
> ---
>
> Key: SPARK-22311
> URL: https://issues.apache.org/jira/browse/SPARK-22311
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Trivial
>
> stage api modify the description format
>  A list of all stages for a given application.
>  ?status=[active|complete|pending|failed] list only 
> stages in the state.
> content should be included in  
> add version api  doc '/api/v1/version'
> modify the duration real-time calculation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22283.
---
Resolution: Not A Problem

> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22290) Starting second context in same JVM fails to get new Hive delegation token

2017-10-19 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-22290.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19509
[https://github.com/apache/spark/pull/19509]

> Starting second context in same JVM fails to get new Hive delegation token
> --
>
> Key: SPARK-22290
> URL: https://issues.apache.org/jira/browse/SPARK-22290
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> Consider the following pyspark script:
> {code}
> sc = SparkContext()
> // do stuff
> sc.stop()
> // do some other stuff
> sc = SparkContext()
> {code}
> That code didn't use to work at all in 2.2 (failure to create the second 
> context), but makes more progress in 2.3. But it fails to create new Hive 
> delegation tokens; you see this error in the output:
> {noformat}
> 17/10/16 16:26:50 INFO security.HadoopFSDelegationTokenProvider: getting 
> token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1714191595_19, 
> ugi=blah(auth:KERBEROS)]]
> 17/10/16 16:26:50 INFO hive.metastore: Trying to connect to metastore with 
> URI blah
> 17/10/16 16:26:50 INFO hive.metastore: Connected to metastore.
> 17/10/16 16:26:50 ERROR metadata.Hive: MetaException(message:Delegation Token 
> can be issued only with kerberos authentication. Current 
> AuthenticationMethod: TOKEN)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore
> {noformat}
> The error is printed in the logs but it doesn't cause the app to fail (which 
> might be considered wrong).
> The effect is that when that old delegation token expires the new app will 
> fail.
> But the real issue here is that Spark shouldn't be mixing delegation tokens 
> from different apps. It should try harder to isolate a set of delegation 
> tokens to a single app submission.
> And, in the case of Hive, there are many situations where a delegation token 
> isn't needed at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22290) Starting second context in same JVM fails to get new Hive delegation token

2017-10-19 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-22290:
---

Assignee: Marcelo Vanzin

> Starting second context in same JVM fails to get new Hive delegation token
> --
>
> Key: SPARK-22290
> URL: https://issues.apache.org/jira/browse/SPARK-22290
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> Consider the following pyspark script:
> {code}
> sc = SparkContext()
> // do stuff
> sc.stop()
> // do some other stuff
> sc = SparkContext()
> {code}
> That code didn't use to work at all in 2.2 (failure to create the second 
> context), but makes more progress in 2.3. But it fails to create new Hive 
> delegation tokens; you see this error in the output:
> {noformat}
> 17/10/16 16:26:50 INFO security.HadoopFSDelegationTokenProvider: getting 
> token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1714191595_19, 
> ugi=blah(auth:KERBEROS)]]
> 17/10/16 16:26:50 INFO hive.metastore: Trying to connect to metastore with 
> URI blah
> 17/10/16 16:26:50 INFO hive.metastore: Connected to metastore.
> 17/10/16 16:26:50 ERROR metadata.Hive: MetaException(message:Delegation Token 
> can be issued only with kerberos authentication. Current 
> AuthenticationMethod: TOKEN)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore
> {noformat}
> The error is printed in the logs but it doesn't cause the app to fail (which 
> might be considered wrong).
> The effect is that when that old delegation token expires the new app will 
> fail.
> But the real issue here is that Spark shouldn't be mixing delegation tokens 
> from different apps. It should try harder to isolate a set of delegation 
> tokens to a single app submission.
> And, in the case of Hive, there are many situations where a delegation token 
> isn't needed at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >