[jira] [Created] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
Reynold Xin created HIVE-16391: -- Summary: Publish proper Hive 1.2 jars (without including all dependencies in uber jar) Key: HIVE-16391 URL: https://issues.apache.org/jira/browse/HIVE-16391 Project: Hive Issue Type: Task Components: Build Infrastructure Reporter: Reynold Xin Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the only change in the fork is to work around the issue that Hive publishes only two sets of jars: one set with no dependency declared, and another with all the dependencies included in the published uber jar. There is general consensus on both sides that we should remove the forked Hive. The change in the forked version is recorded here https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 Note that the fork in the past included other fixes but those have all become unnecessary. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-9362) Document API Gurantees
[ https://issues.apache.org/jira/browse/HIVE-9362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308023#comment-14308023 ] Reynold Xin commented on HIVE-9362: --- It's great to see this ticket! It is an important step towards Hive being a platform and would be tremendously useful to Spark. > Document API Gurantees > -- > > Key: HIVE-9362 > URL: https://issues.apache.org/jira/browse/HIVE-9362 > Project: Hive > Issue Type: Task >Reporter: Brock Noland >Priority: Blocker > Fix For: 0.15.0 > > > This is an uber JIRA to document our API compatibility guarantees. Similar to > Hadoop I believe we should have > [InterfaceAudience|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-annotations/src/main/java/org/apache/hadoop/classification/InterfaceAudience.java] > and > [InterfaceStability|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-annotations/src/main/java/org/apache/hadoop/classification/InterfaceStability.java] > which I believe originally came from Sun. > This project would be an effort by the Hive community including other > projects which depend on HIve API's to document which API's they use. > Although all API's that they use may not be considered {{Stable}} or even > {{Evolving}} we'll at least have any idea of who were are breaking when a > change is made. > Beyond the Java API there is the Thrift API. Many projects directly use the > Thrift binding since we don't provide an API in say Python. As such I'd > suggest we consider the Thrift API to be {{Public}} and {{Stable}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated HIVE-9410: -- Description: We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) {code} INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) right (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) ... Caused by: java.lang.ClassNotFoundException: de.bankmark.bigbench.queries.q10.SentimentUDF at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136) ... 55 more {code} was: We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). It will be passed for default Hive (on MR) mode, while failed for Hive On Spark mode (both Standalone and Yarn-Client). Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue still exists. BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed. Detail Error Message is as below (NOTE: de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar bigbenchqueriesmr.jar, and we have add command like 'add jar /location/to/bigbenchqueriesmr.jar;' into .sql explicitly) INFO [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: de.bankmark.bigbench.queries.q10.SentimentUDF Serialization trace: gener
[jira] [Commented] (HIVE-7333) Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209327#comment-14209327 ] Reynold Xin commented on HIVE-7333: --- Don't think any changes are necessary in Spark. At the end of the day you can run arbitrary code on arbitrary records for each partition - using that alone should be sufficient to run vectorization. You can even put an entire partition of records into one iterator output ... > Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch] > - > > Key: HIVE-7333 > URL: https://issues.apache.org/jira/browse/HIVE-7333 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Xuefu Zhang >Assignee: Rui Li > Labels: Spark-M1 > > Please refer to the design specification. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7333) Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209315#comment-14209315 ] Reynold Xin commented on HIVE-7333: --- This is pretty trivial to solve. Each "row" in a RDD can be a batch of rows. > Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch] > - > > Key: HIVE-7333 > URL: https://issues.apache.org/jira/browse/HIVE-7333 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Xuefu Zhang >Assignee: Rui Li > Labels: Spark-M1 > > Please refer to the design specification. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing
[ https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078939#comment-14078939 ] Reynold Xin commented on HIVE-7334: --- BTW definitely look at https://github.com/apache/spark/pull/1499 > Create SparkShuffler, shuffling data between map-side data processing and > reduce-side processing > > > Key: HIVE-7334 > URL: https://issues.apache.org/jira/browse/HIVE-7334 > Project: Hive > Issue Type: Sub-task >Reporter: Xuefu Zhang >Assignee: Rui Li > Attachments: HIVE-7334.patch > > > Please refer to the design spec. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7387) Guava version conflict between hadoop and spark [Spark-Branch]
[ https://issues.apache.org/jira/browse/HIVE-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated HIVE-7387: -- Description: hadoop-hdfs and hadoop-comman have dependency on guava-11.0.2.jar, and spark dependent on guava-14.0.1.jar. guava-11.0.2 has API conflict with guava-14.0.1, as Hive CLI load both dependency into classpath currently, query failed on either spark engine or mr engine. {code} java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661) at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812) at org.apache.spark.broadcast.HttpBroadcast.(HttpBroadcast.scala:52) at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:35) at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776) at org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112) at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:527) at org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:307) at org.apache.hadoop.hive.ql.exec.spark.SparkClient.createRDD(SparkClient.java:204) at org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:167) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:32) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:159) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72) {code} NO PRECOMMIT TESTS. This is for spark branch only. was: hadoop-hdfs and hadoop-comman have dependency on guava-11.0.2.jar, and spark dependent on guava-14.0.1.jar. guava-11.0.2 has API conflict with guava-14.0.1, as Hive CLI load both dependency into classpath currently, query failed on either spark engine or mr engine. java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661) at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812) at org.apache.spark.broadcast.HttpBroadcast.(HttpBroadcast.scala:52) at org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(
[jira] [Commented] (HIVE-3772) Fix a concurrency bug in LazyBinaryUtils due to a static field (patch by Reynold Xin)
[ https://issues.apache.org/jira/browse/HIVE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510333#comment-13510333 ] Reynold Xin commented on HIVE-3772: --- Thanks for submitting this, Mikhail. Note that this was introduced in 0.9. In 0.7, this was not a problem ... > Fix a concurrency bug in LazyBinaryUtils due to a static field (patch by > Reynold Xin) > - > > Key: HIVE-3772 > URL: https://issues.apache.org/jira/browse/HIVE-3772 > Project: Hive > Issue Type: Bug >Reporter: Mikhail Bautin > > Creating a JIRA for [~rxin]'s patch needed by the Shark project. > https://github.com/amplab/hive/commit/17e1c3dd2f6d8eca767115dc46d5a880aed8c765 > writeVLong should not use a static field due to concurrency concerns. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira