[jira] [Created] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2017-04-05 Thread Reynold Xin (JIRA)
Reynold Xin created HIVE-16391:
--

 Summary: Publish proper Hive 1.2 jars (without including all 
dependencies in uber jar)
 Key: HIVE-16391
 URL: https://issues.apache.org/jira/browse/HIVE-16391
 Project: Hive
  Issue Type: Task
  Components: Build Infrastructure
Reporter: Reynold Xin


Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
only change in the fork is to work around the issue that Hive publishes only 
two sets of jars: one set with no dependency declared, and another with all the 
dependencies included in the published uber jar.

There is general consensus on both sides that we should remove the forked Hive.

The change in the forked version is recorded here 
https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2


Note that the fork in the past included other fixes but those have all become 
unnecessary.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-9362) Document API Gurantees

2015-02-05 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308023#comment-14308023
 ] 

Reynold Xin commented on HIVE-9362:
---

It's great to see this ticket! It is an important step towards Hive being a 
platform and would be tremendously useful to Spark.

> Document API Gurantees
> --
>
> Key: HIVE-9362
> URL: https://issues.apache.org/jira/browse/HIVE-9362
> Project: Hive
>  Issue Type: Task
>Reporter: Brock Noland
>Priority: Blocker
> Fix For: 0.15.0
>
>
> This is an uber JIRA to document our API compatibility guarantees. Similar to 
> Hadoop I believe we should have 
> [InterfaceAudience|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-annotations/src/main/java/org/apache/hadoop/classification/InterfaceAudience.java]
>  and 
> [InterfaceStability|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-annotations/src/main/java/org/apache/hadoop/classification/InterfaceStability.java]
>  which I believe originally came from Sun.
> This project would be an effort by the Hive community including other 
> projects which depend on HIve API's to document which API's they use. 
> Although all API's that they use may not be considered {{Stable}} or even 
> {{Evolving}} we'll at least have any idea of who were are breaking when a 
> change is made.
> Beyond the Java API there is the Thrift API. Many projects directly use the 
> Thrift binding since we don't provide an API in say Python. As such I'd 
> suggest we consider the Thrift API to be {{Public}} and {{Stable}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9410) ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch]

2015-01-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated HIVE-9410:
--
Description: 
We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). 
It will be passed for default Hive (on MR) mode, while failed for Hive On Spark 
mode (both Standalone and Yarn-Client). 

Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue 
still exists. 

BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed.

Detail Error Message is as below (NOTE: 
de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar 
bigbenchqueriesmr.jar, and we have add command like 'add jar 
/location/to/bigbenchqueriesmr.jar;' into .sql explicitly)

{code}
INFO  [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - 
Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: 
de.bankmark.bigbench.queries.q10.SentimentUDF
Serialization trace:
genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc)
conf (org.apache.hadoop.hive.ql.exec.UDTFOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.MapJoinOperator)
childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
right (org.apache.commons.lang3.tuple.ImmutablePair)
edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
...
Caused by: java.lang.ClassNotFoundException: 
de.bankmark.bigbench.queries.q10.SentimentUDF
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136)
... 55 more
{code}

  was:
We have a hive query case with UDF defined (i.e. BigBench case Q10, Q18 etc.). 
It will be passed for default Hive (on MR) mode, while failed for Hive On Spark 
mode (both Standalone and Yarn-Client). 

Although we use 'add jar .jar;' to add the UDF jar explicitly, the issue 
still exists. 

BTW, if we put the UDF jar into $HIVE_HOME/lib dir, the case will be passed.

Detail Error Message is as below (NOTE: 
de.bankmark.bigbench.queries.q10.SentimentUDF is the UDF which contained in jar 
bigbenchqueriesmr.jar, and we have add command like 'add jar 
/location/to/bigbenchqueriesmr.jar;' into .sql explicitly)

INFO  [pool-1-thread-1]: client.RemoteDriver (RemoteDriver.java:call(316)) - 
Failed to run job 8dd120cb-1a4d-4d1c-ba31-61eac648c27d
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: 
de.bankmark.bigbench.queries.q10.SentimentUDF
Serialization trace:
gener

[jira] [Commented] (HIVE-7333) Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]

2014-11-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209327#comment-14209327
 ] 

Reynold Xin commented on HIVE-7333:
---

Don't think any changes are necessary in Spark. At the end of the day you can 
run arbitrary code on arbitrary records for each partition - using that alone 
should be sufficient to run vectorization. 

You can even put an entire partition of records into one iterator output ...


> Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
> -
>
> Key: HIVE-7333
> URL: https://issues.apache.org/jira/browse/HIVE-7333
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
>
> Please refer to the design specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7333) Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]

2014-11-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209315#comment-14209315
 ] 

Reynold Xin commented on HIVE-7333:
---

This is pretty trivial to solve. Each "row" in a RDD can be a batch of rows.


> Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
> -
>
> Key: HIVE-7333
> URL: https://issues.apache.org/jira/browse/HIVE-7333
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
>
> Please refer to the design specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

2014-07-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078939#comment-14078939
 ] 

Reynold Xin commented on HIVE-7334:
---

BTW definitely look at https://github.com/apache/spark/pull/1499

> Create SparkShuffler, shuffling data between map-side data processing and 
> reduce-side processing
> 
>
> Key: HIVE-7334
> URL: https://issues.apache.org/jira/browse/HIVE-7334
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Xuefu Zhang
>Assignee: Rui Li
> Attachments: HIVE-7334.patch
>
>
> Please refer to the design spec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7387) Guava version conflict between hadoop and spark [Spark-Branch]

2014-07-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated HIVE-7387:
--

Description: 
hadoop-hdfs and hadoop-comman have dependency on guava-11.0.2.jar, and spark 
dependent on guava-14.0.1.jar. guava-11.0.2 has API conflict with guava-14.0.1, 
as Hive CLI load both dependency into classpath currently, query failed on 
either spark engine or mr engine.

{code}
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at 
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at 
org.apache.spark.broadcast.HttpBroadcast.(HttpBroadcast.scala:52)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:35)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
at org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112)
at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:527)
at 
org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:307)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkClient.createRDD(SparkClient.java:204)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:167)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:32)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:159)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
{code}

NO PRECOMMIT TESTS. This is for spark branch only.

  was:
hadoop-hdfs and hadoop-comman have dependency on guava-11.0.2.jar, and spark 
dependent on guava-14.0.1.jar. guava-11.0.2 has API conflict with guava-14.0.1, 
as Hive CLI load both dependency into classpath currently, query failed on 
either spark engine or mr engine.

java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at 
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at 
org.apache.spark.broadcast.HttpBroadcast.(HttpBroadcast.scala:52)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(

[jira] [Commented] (HIVE-3772) Fix a concurrency bug in LazyBinaryUtils due to a static field (patch by Reynold Xin)

2012-12-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510333#comment-13510333
 ] 

Reynold Xin commented on HIVE-3772:
---

Thanks for submitting this, Mikhail. Note that this was introduced in 0.9. In 
0.7, this was not a problem ...

> Fix a concurrency bug in LazyBinaryUtils due to a static field (patch by 
> Reynold Xin)
> -
>
> Key: HIVE-3772
> URL: https://issues.apache.org/jira/browse/HIVE-3772
> Project: Hive
>  Issue Type: Bug
>Reporter: Mikhail Bautin
>
> Creating a JIRA for [~rxin]'s patch needed by the Shark project. 
> https://github.com/amplab/hive/commit/17e1c3dd2f6d8eca767115dc46d5a880aed8c765
> writeVLong should not use a static field due to concurrency concerns.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira