[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567279#comment-14567279 ] Jianshi Huang commented on SPARK-4782: -- Thanks Luca for the clever fix! I also noticed that the schema inference in JsonRDD is too JSON specific. As JSON's datatype is quite limited. Jianshi Add inferSchema support for RDD[Map[String, Any]] - Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8012) ArrayIndexOutOfBoundsException in SerializationDebugger
Jianshi Huang created SPARK-8012: Summary: ArrayIndexOutOfBoundsException in SerializationDebugger Key: SPARK-8012 URL: https://issues.apache.org/jira/browse/SPARK-8012 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Jianshi Huang It makes NonSerializable exception less obvious. {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:248) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:166) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:66) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:40) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:159) at org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:131) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.sources.DataSourceStrategy$.buildPartitionedTableScan(DataSourceStrategy.scala:131) at org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:80) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$HashJoin$.apply(SparkStrategies.scala:109) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at
[jira] [Created] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder
Jianshi Huang created SPARK-8014: Summary: DataFrame.write.mode(error).save(...) should not scan the output folder Key: SPARK-8014 URL: https://issues.apache.org/jira/browse/SPARK-8014 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Priority: Minor I have code that set the wrong output location, but failed with strange errors, it scaned my ~/.Trash folder... It turned out save will scan the output folder first before mode(error) does the check. Scanning is unnecessary for mode = error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8012) ArrayIndexOutOfBoundsException in SerializationDebugger
[ https://issues.apache.org/jira/browse/SPARK-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568350#comment-14568350 ] Jianshi Huang commented on SPARK-8012: -- Yeah, it's from pretty big code base. I'm trying to reduce the scope. BTW, I'm using 1.4.0-rc3. Jianshi ArrayIndexOutOfBoundsException in SerializationDebugger --- Key: SPARK-8012 URL: https://issues.apache.org/jira/browse/SPARK-8012 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Jianshi Huang It makes NonSerializable exception less obvious. {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:248) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:166) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:66) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:40) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:159) at org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:131) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.sources.DataSourceStrategy$.buildPartitionedTableScan(DataSourceStrategy.scala:131) at org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:80) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at
[jira] [Commented] (SPARK-6297) EventLog permissions are always set to 770 which causes problems
[ https://issues.apache.org/jira/browse/SPARK-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566907#comment-14566907 ] Jianshi Huang commented on SPARK-6297: -- What about YARN? I have a problem to share event logs among our team. Spark processes in YARN are under user's account. Jianshi EventLog permissions are always set to 770 which causes problems Key: SPARK-6297 URL: https://issues.apache.org/jira/browse/SPARK-6297 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: All, tested in lcx running different users without common group Reporter: Lukas Stefaniak Priority: Trivial Labels: newbie In EventLogginListener event log files permissions are set explicitly always to 770. There is no way to override it in any way. Problem appears as exception being thrown when driver process and spark master don't share same user or group. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7939) Make URL partition recognition return String by default for all partition column types and values
[ https://issues.apache.org/jira/browse/SPARK-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564412#comment-14564412 ] Jianshi Huang commented on SPARK-7939: -- That would be nice, also consider disabling type inference as default behavior. Jianshi Make URL partition recognition return String by default for all partition column types and values - Key: SPARK-7939 URL: https://issues.apache.org/jira/browse/SPARK-7939 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Labels: 1.4.1 Imagine the following HDFS paths: /data/split=00 /data/split=01 ... /data/split=FF If I have less than or equal to 10 partitions (00, 01, ... 09), currently partition recognition will treat column 'split' as integer column. If I have more than 10 partitions, column 'split' will be recognized as String... This is very confusing. *So I'm suggesting to treat partition columns as String by default*, and allow user to specify types if needed. Another example is date: /data/date=2015-04-01 = 'date' is String /data/date=20150401 = 'date' is Int Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7939) Make URL partition recognition return String by default for all partition column types and values
[ https://issues.apache.org/jira/browse/SPARK-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-7939: - Summary: Make URL partition recognition return String by default for all partition column types and values (was: Make URL partition recognition return String by default for all partition column values) Make URL partition recognition return String by default for all partition column types and values - Key: SPARK-7939 URL: https://issues.apache.org/jira/browse/SPARK-7939 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Imagine the following HDFS paths: /data/split=00 /data/split=01 ... /data/split=FF If I have less than or equal to 10 partitions (00, 01, ... 09), currently partition recognition will treat column 'split' as integer column. If I have more than 10 partitions, column 'split' will be recognized as String... This is very confusing. *So I'm suggesting to treat partition columns as String by default*, and allow user to specify types if needed. Another example is date: /data/date=2015-04-01 = 'date' is String /data/date=20150401 = 'date' is Int Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)
Jianshi Huang created SPARK-7937: Summary: Cannot compare Hive named_struct. (when using argmax, argmin) Key: SPARK-7937 URL: https://issues.apache.org/jira/browse/SPARK-7937 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Imagine the following SQL: Intention: get last used bank account country. select bank_account_id, max(named_struct( 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 'bank_country', bank_country)).bank_country from bank_account_monthly where year_month='201502' group by bank_account_id = Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type StructType(StructField(src_row_update_ts,LongType,true), StructField(bank_country,StringType,true)) does not support ordered operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215) at org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235) at org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)
[ https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-7937: - Description: Imagine the following SQL: Intention: get last used bank account country. {code:sql} select bank_account_id, max(named_struct( 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 'bank_country', bank_country)).bank_country from bank_account_monthly where year_month='201502' group by bank_account_id {code} = {noformat} Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type StructType(StructField(src_row_update_ts,LongType,true), StructField(bank_country,StringType,true)) does not support ordered operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215) at org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235) at org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {noformat} was: Imagine the following SQL: Intention: get last used bank account country. ``` sql select bank_account_id, max(named_struct( 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 'bank_country', bank_country)).bank_country from bank_account_monthly where year_month='201502' group by bank_account_id ``` = ``` Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type StructType(StructField(src_row_update_ts,LongType,true), StructField(bank_country,StringType,true)) does not support ordered operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215) at org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235) at org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at
[jira] [Updated] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)
[ https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-7937: - Description: Imagine the following SQL: Intention: get last used bank account country. ``` sql select bank_account_id, max(named_struct( 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 'bank_country', bank_country)).bank_country from bank_account_monthly where year_month='201502' group by bank_account_id ``` = ``` Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type StructType(StructField(src_row_update_ts,LongType,true), StructField(bank_country,StringType,true)) does not support ordered operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215) at org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235) at org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) ``` was: Imagine the following SQL: Intention: get last used bank account country. select bank_account_id, max(named_struct( 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 'bank_country', bank_country)).bank_country from bank_account_monthly where year_month='201502' group by bank_account_id = Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type StructType(StructField(src_row_update_ts,LongType,true), StructField(bank_country,StringType,true)) does not support ordered operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215) at org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235) at org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at
[jira] [Created] (SPARK-7939) Make URL partition recognition return String by default for all partition column values
Jianshi Huang created SPARK-7939: Summary: Make URL partition recognition return String by default for all partition column values Key: SPARK-7939 URL: https://issues.apache.org/jira/browse/SPARK-7939 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Imagine the following HDFS paths: /data/split=00 /data/split=01 ... /data/split=FF If I have less than or equal to 10 partitions (00, 01, ... 09), currently partition recognition will treat column 'split' as integer column. If I have more than 10 partitions, column 'split' will be recognized as String... This is very confusing. *So I'm suggesting to treat partition columns as String by default*, and allow user to specify types if needed. Another example is date: /data/date=2015-04-01 = 'date' is String /data/date=20150401 = 'date' is Int Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)
[ https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564267#comment-14564267 ] Jianshi Huang commented on SPARK-7937: -- Blog for describing Hive's argmax, argmin feature: https://www.joefkelley.com/?p=727 HIVE JIRA: https://issues.apache.org/jira/browse/HIVE-1128 Jianshi Cannot compare Hive named_struct. (when using argmax, argmin) - Key: SPARK-7937 URL: https://issues.apache.org/jira/browse/SPARK-7937 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Imagine the following SQL: Intention: get last used bank account country. {code:sql} select bank_account_id, max(named_struct( 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 'bank_country', bank_country)).bank_country from bank_account_monthly where year_month='201502' group by bank_account_id {code} = {noformat} Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type StructType(StructField(src_row_update_ts,LongType,true), StructField(bank_country,StringType,true)) does not support ordered operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222) at org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215) at org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235) at org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource
[ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550028#comment-14550028 ] Jianshi Huang commented on SPARK-6533: -- Ah, right. Tested in 1.4.0, sqlc.load works! Thanks Baishuo. Jianshi Allow using wildcard and other file pattern in Parquet DataSource - Key: SPARK-6533 URL: https://issues.apache.org/jira/browse/SPARK-6533 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Priority: Critical By default, spark.sql.parquet.useDataSourceApi is set to true. And loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12]) java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.init(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications. Please add this important feature. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource
[ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang resolved SPARK-6533. -- Resolution: Won't Fix Don't use sqlc.parquetFile(...), use sqlc.load(..., parquet) instead, or the latest Reader/Writer API. Jianshi Allow using wildcard and other file pattern in Parquet DataSource - Key: SPARK-6533 URL: https://issues.apache.org/jira/browse/SPARK-6533 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Priority: Critical By default, spark.sql.parquet.useDataSourceApi is set to true. And loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12]) java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.init(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications. Please add this important feature. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7614) CLONE - Master fails on 2.11 with compilation error
[ https://issues.apache.org/jira/browse/SPARK-7614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542596#comment-14542596 ] Jianshi Huang commented on SPARK-7614: -- Yeah, it seems I cannot reopen 7399. That's why I cloned it. Jianshi CLONE - Master fails on 2.11 with compilation error --- Key: SPARK-7614 URL: https://issues.apache.org/jira/browse/SPARK-7614 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Jianshi Huang Assignee: Tijo Thomas The current code in master (and 1.4 branch) fails on 2.11 with the following compilation error: {code} [error] /home/ubuntu/workspace/Apache Spark (master) on 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in object RDDOperationScope, multiple overloaded alternatives of method withScope define default arguments. [error] private[spark] object RDDOperationScope { [error] ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4356) Test Scala 2.11 on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542603#comment-14542603 ] Jianshi Huang commented on SPARK-4356: -- When can we have 2.11 build tests in Jenkins? Jianshi Test Scala 2.11 on Jenkins -- Key: SPARK-4356 URL: https://issues.apache.org/jira/browse/SPARK-4356 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical We need to make some modifications to the test harness so that we can test Scala 2.11 in Maven regularly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3056) Sort-based Aggregation
[ https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542148#comment-14542148 ] Jianshi Huang commented on SPARK-3056: -- Will [SPARK-2926] alone enough for this improvement? Sort-based Aggregation -- Key: SPARK-3056 URL: https://issues.apache.org/jira/browse/SPARK-3056 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, SparkSQL only support the hash-based aggregation, which may cause OOM if too many identical keys in the input tuples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7614) CLONE - Master fails on 2.11 with compilation error
Jianshi Huang created SPARK-7614: Summary: CLONE - Master fails on 2.11 with compilation error Key: SPARK-7614 URL: https://issues.apache.org/jira/browse/SPARK-7614 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Jianshi Huang Assignee: Tijo Thomas Fix For: 1.4.0 The current code in master (and 1.4 branch) fails on 2.11 with the following compilation error: {code} [error] /home/ubuntu/workspace/Apache Spark (master) on 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in object RDDOperationScope, multiple overloaded alternatives of method withScope define default arguments. [error] private[spark] object RDDOperationScope { [error] ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7399) Master fails on 2.11 with compilation error
[ https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542589#comment-14542589 ] Jianshi Huang commented on SPARK-7399: -- Looks like another change makes it broken again. Jianshi Master fails on 2.11 with compilation error --- Key: SPARK-7399 URL: https://issues.apache.org/jira/browse/SPARK-7399 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Iulian Dragos Assignee: Tijo Thomas Fix For: 1.4.0 The current code in master (and 1.4 branch) fails on 2.11 with the following compilation error: {code} [error] /home/ubuntu/workspace/Apache Spark (master) on 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in object RDDOperationScope, multiple overloaded alternatives of method withScope define default arguments. [error] private[spark] object RDDOperationScope { [error] ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539617#comment-14539617 ] Jianshi Huang commented on SPARK-6154: -- Thanks Aniket, If you can submit the pull request, I'm happy to try it. Jianshi Support Kafka, JDBC in Scala 2.11 - Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534344#comment-14534344 ] Jianshi Huang commented on SPARK-6154: -- Do you mean we need to upgrade the jline version for both 2.11 and 2.10? Jianshi Support Kafka, JDBC in Scala 2.11 - Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6561) Add partition support in saveAsParquet
Jianshi Huang created SPARK-6561: Summary: Add partition support in saveAsParquet Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquet(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6561: - Description: Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi was: Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquet(path: String, partitionColumns: Seq[String]) {code} Jianshi Add partition support in saveAsParquet -- Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6533) Cannot use wildcard and other file pattern in sqlContext.parquetFile if spark.sql.parquet.useDataSourceApi is not set to false
Jianshi Huang created SPARK-6533: Summary: Cannot use wildcard and other file pattern in sqlContext.parquetFile if spark.sql.parquet.useDataSourceApi is not set to false Key: SPARK-6533 URL: https://issues.apache.org/jira/browse/SPARK-6533 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang If spark.sql.parquet.useDataSourceApi is not set to false, which is the default. Loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12]) java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.init(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource
[ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6533: - Description: By default, spark.sql.parquet.useDataSourceApi is set to true. And loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12]) java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.init(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications. Please add this important feature. Jianshi was: If spark.sql.parquet.useDataSourceApi is not set to false, which is the default. Loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at
[jira] [Updated] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource
[ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6533: - Description: If spark.sql.parquet.useDataSourceApi is not set to false, which is the default. Loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12]) java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.init(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications. Please add this important feature. Jianshi was: If spark.sql.parquet.useDataSourceApi is not set to false, which is the default. Loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at
[jira] [Created] (SPARK-6432) Cannot load parquet data with partitions if not all partition columns match data columns
Jianshi Huang created SPARK-6432: Summary: Cannot load parquet data with partitions if not all partition columns match data columns Key: SPARK-6432 URL: https://issues.apache.org/jira/browse/SPARK-6432 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Suppose we have a dataset in the following folder structure: {noformat} parquet/source=live/date=2015-03-18/ parquet/source=live/date=2015-03-19/ ... {noformat} And the data schema has the following columns: - id - *event_date* - source - value Where partition key source matches data column source, but partition key date doesn't match any columns in data. Then we cannot load dataset in Spark using parquetFile. It reports: org.apache.spark.sql.AnalysisException: Ambiguous references to source: (source#2,List()),(source#5,List()); ... Currently if partition columns has overlaps with data columns, partition columns have to be a subset of the data columns. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6432) Cannot load parquet data with partitions if not all partition columns match data columns
[ https://issues.apache.org/jira/browse/SPARK-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371061#comment-14371061 ] Jianshi Huang commented on SPARK-6432: -- If no partition column appear in the data columns, then it's fine. Jianshi Cannot load parquet data with partitions if not all partition columns match data columns Key: SPARK-6432 URL: https://issues.apache.org/jira/browse/SPARK-6432 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Assignee: Cheng Lian Suppose we have a dataset in the following folder structure: {noformat} parquet/source=live/date=2015-03-18/ parquet/source=live/date=2015-03-19/ ... {noformat} And the data schema has the following columns: - id - *event_date* - source - value Where partition key source matches data column source, but partition key date doesn't match any columns in data. Then we cannot load dataset in Spark using parquetFile. It reports: {code} org.apache.spark.sql.AnalysisException: Ambiguous references to source: (source#2,List()),(source#5,List()); ... {code} Currently if partition columns has overlaps with data columns, partition columns have to be a subset of the data columns. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6432) Cannot load parquet data with partitions if not all partition columns match data columns
[ https://issues.apache.org/jira/browse/SPARK-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6432: - Description: Suppose we have a dataset in the following folder structure: {noformat} parquet/source=live/date=2015-03-18/ parquet/source=live/date=2015-03-19/ ... {noformat} And the data schema has the following columns: - id - *event_date* - source - value Where partition key source matches data column source, but partition key date doesn't match any columns in data. Then we cannot load dataset in Spark using parquetFile. It reports: {code} org.apache.spark.sql.AnalysisException: Ambiguous references to source: (source#2,List()),(source#5,List()); ... {code} Currently if partition columns has overlaps with data columns, partition columns have to be a subset of the data columns. Jianshi was: Suppose we have a dataset in the following folder structure: {noformat} parquet/source=live/date=2015-03-18/ parquet/source=live/date=2015-03-19/ ... {noformat} And the data schema has the following columns: - id - *event_date* - source - value Where partition key source matches data column source, but partition key date doesn't match any columns in data. Then we cannot load dataset in Spark using parquetFile. It reports: org.apache.spark.sql.AnalysisException: Ambiguous references to source: (source#2,List()),(source#5,List()); ... Currently if partition columns has overlaps with data columns, partition columns have to be a subset of the data columns. Jianshi Cannot load parquet data with partitions if not all partition columns match data columns Key: SPARK-6432 URL: https://issues.apache.org/jira/browse/SPARK-6432 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Suppose we have a dataset in the following folder structure: {noformat} parquet/source=live/date=2015-03-18/ parquet/source=live/date=2015-03-19/ ... {noformat} And the data schema has the following columns: - id - *event_date* - source - value Where partition key source matches data column source, but partition key date doesn't match any columns in data. Then we cannot load dataset in Spark using parquetFile. It reports: {code} org.apache.spark.sql.AnalysisException: Ambiguous references to source: (source#2,List()),(source#5,List()); ... {code} Currently if partition columns has overlaps with data columns, partition columns have to be a subset of the data columns. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope
[ https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6382: - Description: Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi was: Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi withUDF(...) {...} for supporting temporary UDF definitions in the scope Key: SPARK-6382 URL: https://issues.apache.org/jira/browse/SPARK-6382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope
[ https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6382: - Description: Currently the scope of UDF registration is global. It's unsuitable for libraries that are built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi was: Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi withUDF(...) {...} for supporting temporary UDF definitions in the scope Key: SPARK-6382 URL: https://issues.apache.org/jira/browse/SPARK-6382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Currently the scope of UDF registration is global. It's unsuitable for libraries that are built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope
Jianshi Huang created SPARK-6382: Summary: withUDF(...) {...} for supporting temporary UDF definitions in the scope Key: SPARK-6382 URL: https://issues.apache.org/jira/browse/SPARK-6382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6363) make scala 2.11 default language
[ https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365726#comment-14365726 ] Jianshi Huang commented on SPARK-6363: -- My two cents. The only module that's not working for 2.11 is thriftserver and kafka-streaming, as for Kafka they only started to support 2.11 from 0.8.2 and Spark kafka module depends on 0.8.0. It should be just a change in Kafka version to make it ready for 2.11. AFAIK, 0.8.2 client works well with previous 0.8.x server. And we need to fix thriftserver build issue. Jianshi make scala 2.11 default language Key: SPARK-6363 URL: https://issues.apache.org/jira/browse/SPARK-6363 Project: Spark Issue Type: Improvement Components: Build Reporter: antonkulaga Priority: Minor Labels: scala Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 support. So, it will be better if Spark binaries would be build with Scala 2.11 by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6195) Specialized in-memory column type for decimal
[ https://issues.apache.org/jira/browse/SPARK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365549#comment-14365549 ] Jianshi Huang commented on SPARK-6195: -- Like this optimization! :) Jianshi Specialized in-memory column type for decimal - Key: SPARK-6195 URL: https://issues.apache.org/jira/browse/SPARK-6195 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.3.0, 1.2.1 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.4.0 When building in-memory columnar representation, decimal values are currently serialized via a generic serializer, which is unnecessarily slow. Since decimals are actually annotated long values, we should add a specialized decimal column type to speed up in-memory decimal serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356478#comment-14356478 ] Jianshi Huang commented on SPARK-6277: -- I see. Not relating to hadoop config is fine, how about env variables? It's quite often that I want to change one setting for particular tasks, editing spark-defaults.conf everytime is inconvenient. Env variables is a best fit here because of its dynamic scope. Typesafe's config has similar features for having env variables and it can even allow it override previous settings. https://github.com/typesafehub/config#optional-system-or-env-variable-overrides Jianshi Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf - Key: SPARK-6277 URL: https://issues.apache.org/jira/browse/SPARK-6277 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.0, 1.2.1 Reporter: Jianshi Huang I need to set spark.local.dir to use user local home instead of /tmp, but currently spark-defaults.conf can only allow constant values. What I want to do is to write: bq. spark.local.dir /home/${user.name}/spark/tmp or bq. spark.local.dir /home/${USER}/spark/tmp Otherwise I would have to hack bin/spark-class and pass the option through -Dspark.local.dir Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf
Jianshi Huang created SPARK-6277: Summary: Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf Key: SPARK-6277 URL: https://issues.apache.org/jira/browse/SPARK-6277 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.2.1, 1.3.0 Reporter: Jianshi Huang I need to set spark.local.dir to use user local home instead of /tmp, but currently spark-defaults.conf can only allow constant values. What I want to do is to write: bq. spark.local.dir /home/${user.name}/spark/tmp or bq. spark.local.dir /home/${USER}/spark/tmp Otherwise I would have to hack bin/spark-class and pass the option through -Dspark.local.dir Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248 ] Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:40 PM: -- Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say bq. spark.sql.strict_mode = true(default) / false Jianshi was (Author: huangjs): Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say ``` spark.sql.strict_mode = true(default) / false ``` Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248 ] Jianshi Huang commented on SPARK-6201: -- Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say spark.sql.strict_mode = true(default) / false Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248 ] Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:39 PM: -- Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say ``` spark.sql.strict_mode = true(default) / false ``` Jianshi was (Author: huangjs): Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say spark.sql.strict_mode = true(default) / false Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352417#comment-14352417 ] Jianshi Huang commented on SPARK-6154: -- I see. Here's my build flag: -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Phadoop-2.4 -Djava.version=1.7 -DskipTests BTW, when will Kafka and JDBC be supported in 2.11 build? Jianshi Build error with Scala 2.11 for v1.3.0-rc2 -- Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support latest Scala (2.11.6+)
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Summary: Support latest Scala (2.11.6+) (was: Support Scala 2.11.6+) Support latest Scala (2.11.6+) -- Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Issue Type: New Feature (was: Improvement) Support Scala 2.11.5+ - Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Priority: Minor Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Priority: Major (was: Minor) Support Scala 2.11.5+ - Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6155) Support Scala 2.11.6+
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6155: - Summary: Support Scala 2.11.6+ (was: Support Scala 2.11.5+) Support Scala 2.11.6+ - Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6201) INSET should coerce types
Jianshi Huang created SPARK-6201: Summary: INSET should coerce types Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.2.0, 1.3.0 Reporter: Jianshi Huang Suppose we the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6201: - Description: Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does.* Jianshi was: Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6201: - Description: Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. Jianshi was: Suppose we the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6201: - Description: Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi was: Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does.* Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data
[ https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348361#comment-14348361 ] Jianshi Huang edited comment on SPARK-5763 at 3/5/15 8:21 AM: -- Upvote for this improvement. There's a similar ticket for groupByKey as well. https://issues.apache.org/jira/browse/SPARK-3461 Jianshi was (Author: huangjs): Upvote for this improvement. Jianshi Sort-based Groupby and Join to resolve skewed data -- Key: SPARK-5763 URL: https://issues.apache.org/jira/browse/SPARK-5763 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Lianhui Wang In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinions about this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data
[ https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348361#comment-14348361 ] Jianshi Huang commented on SPARK-5763: -- Upvote for this improvement. Jianshi Sort-based Groupby and Join to resolve skewed data -- Key: SPARK-5763 URL: https://issues.apache.org/jira/browse/SPARK-5763 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Lianhui Wang In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinions about this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6155) Build with Scala 2.11.5 failed for Spark v1.3.0-rc2
[ https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347969#comment-14347969 ] Jianshi Huang commented on SPARK-6155: -- Yeah, please add the feature request. Just a question. Scala 2.11.2 is only used for Spark runtime right (shaded?)? My application built with 2.11.5 will run fine? Jianshi Build with Scala 2.11.5 failed for Spark v1.3.0-rc2 --- Key: SPARK-6155 URL: https://issues.apache.org/jira/browse/SPARK-6155 Project: Spark Issue Type: Bug Reporter: Jianshi Huang Priority: Minor Just tried to build with Scala 2.11.5. failed with following error message: [INFO] Compiling 9 Scala sources to /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes... [ERROR] /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132: value withIncompleteHandler is not a member of SparkIMain.this.global.PerRunReporting [ERROR] currentRun.reporting.withIncompleteHandler((_, _) = isIncomplete = true) { [ERROR]^ Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6154: - Component/s: SQL Build error with Scala 2.11 for v1.3.0-rc2 -- Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2
Jianshi Huang created SPARK-6154: Summary: Build error with Scala 2.11 for v1.3.0-rc2 Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5828) Dynamic partition pattern support
Jianshi Huang created SPARK-5828: Summary: Dynamic partition pattern support Key: SPARK-5828 URL: https://issues.apache.org/jira/browse/SPARK-5828 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Jianshi Huang Hi, HCatalog allows you to specify the pattern of paths for partitions, which will be used by dynamic partition loading. https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions#HCatalogDynamicPartitions-ExternalTables Can we have similar feature in SparkSQL? Thanks, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX
[ https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308595#comment-14308595 ] Jianshi Huang commented on SPARK-4279: -- Anyone is working on this? Implementing TinkerPop on top of GraphX --- Key: SPARK-4279 URL: https://issues.apache.org/jira/browse/SPARK-4279 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Brennon York Priority: Minor [TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph databases and has been implemented across various graph database backends. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5446) Parquet column pruning should work for Map and Struct
Jianshi Huang created SPARK-5446: Summary: Parquet column pruning should work for Map and Struct Key: SPARK-5446 URL: https://issues.apache.org/jira/browse/SPARK-5446 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang Consider the following query: {code:sql} select stddev_pop(variables.var1) stddev from model group by model_name {code} Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs. During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor. The optimized version could use a subquery: {code:sql} select stddev_pop(var) stddev from (select variables.var1 as var, model_name from model) m group by model_name {code} Where we extract the field/key-value only in the mapper side, so data being shuffled is small. Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
[ https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang closed SPARK-4781. Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext) -- Key: SPARK-4781 URL: https://issues.apache.org/jira/browse/SPARK-4781 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang I have a table say created like follows: {code} CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' {code} And I renamed the column from sorted::cre_ts to cre_ts by doing: {code} ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string {code} After renaming the column, the values in the column become all NULLs. {noformat} Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) {noformat} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
[ https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4781: - Issue Type: Bug (was: Improvement) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext) -- Key: SPARK-4781 URL: https://issues.apache.org/jira/browse/SPARK-4781 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang I have a table say created like follows: {code} CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' {code} And I renamed the column from sorted::cre_ts to cre_ts by doing: {code} ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string {code} After renaming the column, the values in the column become all NULLs. {noformat} Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) {noformat} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4780) Support executing multiple statements in sql(...)
Jianshi Huang created SPARK-4780: Summary: Support executing multiple statements in sql(...) Key: SPARK-4780 URL: https://issues.apache.org/jira/browse/SPARK-4780 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang Priority: Minor Just wondering why executing multiple statements in one sql(...) is not supported. And the statement cannot have ; as well. I had to turn my hive script manually into a sequence of queries and this is not very user friendly. Please think about supporting this. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
Jianshi Huang created SPARK-4781: Summary: Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext) Key: SPARK-4781 URL: https://issues.apache.org/jira/browse/SPARK-4781 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang I have a table say created like follows: CREATE EXTERNAL TABLE pmt { `sorted::cre_ts` string } And I renamed the column from sorted::cre_ts to cre_ts by doing: ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string After renaming the column, the values in the column become all NULLs. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
[ https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4781: - Description: I have a table say created like follows: CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' And I renamed the column from sorted::cre_ts to cre_ts by doing: ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string After renaming the column, the values in the column become all NULLs. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) Jianshi was: I have a table say created like follows: CREATE EXTERNAL TABLE pmt { `sorted::cre_ts` string } And I renamed the column from sorted::cre_ts to cre_ts by doing: ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string After renaming the column, the values in the column become all NULLs. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) Jianshi Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext) -- Key: SPARK-4781 URL: https://issues.apache.org/jira/browse/SPARK-4781 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang I have a table say created like follows: CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' And I renamed the column from sorted::cre_ts to cre_ts by doing: ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string After renaming the column, the values in the column become all NULLs. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
[ https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4781: - Description: I have a table say created like follows: {code} CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' {code} And I renamed the column from sorted::cre_ts to cre_ts by doing: {code} ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string {code} After renaming the column, the values in the column become all NULLs. {noformat} Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) {noformat} Jianshi was: I have a table say created like follows: CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' And I renamed the column from sorted::cre_ts to cre_ts by doing: ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string After renaming the column, the values in the column become all NULLs. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) Jianshi Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext) -- Key: SPARK-4781 URL: https://issues.apache.org/jira/browse/SPARK-4781 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang I have a table say created like follows: {code} CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' {code} And I renamed the column from sorted::cre_ts to cre_ts by doing: {code} ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string {code} After renaming the column, the values in the column become all NULLs. {noformat} Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) {noformat} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
Jianshi Huang created SPARK-4782: Summary: Add inferSchema support for RDD[Map[String, Any]] Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Jianshi Huang Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files
Jianshi Huang created SPARK-4760: Summary: ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files Key: SPARK-4760 URL: https://issues.apache.org/jira/browse/SPARK-4760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang In a older Spark version built around Oct. 12, I was able to use ANALYZE TABLE table COMPUTE STATISTICS noscan to get estimated table size, which is important for optimizing joins. (I'm joining 15 small dimension tables, and this is crucial to me). In the more recent Spark builds, it fails to estimate the table size unless I remove noscan. Here's the statistics I got using DESC EXTENDED: old: parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} new: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} And I've tried turning off spark.sql.hive.convertMetastoreParquet in my spark-defaults.conf and the result is unaffected (in both versions). Looks like the Parquet support in new Hive (0.13.1) is broken? Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4757) Yarn-client failed to start due to Wrong FS error in distCacheMgr.addResource
Jianshi Huang created SPARK-4757: Summary: Yarn-client failed to start due to Wrong FS error in distCacheMgr.addResource Key: SPARK-4757 URL: https://issues.apache.org/jira/browse/SPARK-4757 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang I got the following error during Spark startup (Yarn-client mode): 14/12/04 19:33:58 INFO Client: Uploading resource file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar - hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar java.lang.IllegalArgumentException: Wrong FS: hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35) at org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140) at org.apache.spark.SparkContext.init(SparkContext.scala:335) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) According to Liancheng and Andrew, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4758) Make metastore_db in-memory for HiveContext
Jianshi Huang created SPARK-4758: Summary: Make metastore_db in-memory for HiveContext Key: SPARK-4758 URL: https://issues.apache.org/jira/browse/SPARK-4758 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang Priority: Minor HiveContext by default will create a local folder metastore_db. This is not very user friendly as the metastore_db will be locked by HiveContext and thus will block multiple Spark process to start from the same directory. I would propose adding a default hive-site.xml in conf/ with the following content. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:memory:databaseName=metastore_db;create=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property property namehive.metastore.warehouse.dir/name valuefile://${user.dir}/hive/warehouse/value /property /configuration jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the embedded derby database is created in-memory. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4758) Make metastore_db in-memory for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4758: - Description: HiveContext by default will create a local folder metastore_db. This is not very user friendly as the metastore_db will be locked by HiveContext and thus will block multiple Spark process to start from the same directory. I would propose adding a default hive-site.xml in conf/ with the following content. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:memory;databaseName=metastore_db;create=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property property namehive.metastore.warehouse.dir/name valuefile://${user.dir}/hive/warehouse/value /property /configuration jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the embedded derby database is created in-memory. Jianshi was: HiveContext by default will create a local folder metastore_db. This is not very user friendly as the metastore_db will be locked by HiveContext and thus will block multiple Spark process to start from the same directory. I would propose adding a default hive-site.xml in conf/ with the following content. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:memory:databaseName=metastore_db;create=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property property namehive.metastore.warehouse.dir/name valuefile://${user.dir}/hive/warehouse/value /property /configuration jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the embedded derby database is created in-memory. Jianshi Make metastore_db in-memory for HiveContext --- Key: SPARK-4758 URL: https://issues.apache.org/jira/browse/SPARK-4758 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang Priority: Minor HiveContext by default will create a local folder metastore_db. This is not very user friendly as the metastore_db will be locked by HiveContext and thus will block multiple Spark process to start from the same directory. I would propose adding a default hive-site.xml in conf/ with the following content. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:memory;databaseName=metastore_db;create=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property property namehive.metastore.warehouse.dir/name valuefile://${user.dir}/hive/warehouse/value /property /configuration jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the embedded derby database is created in-memory. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4758) Make metastore_db in-memory for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4758: - Description: HiveContext by default will create a local folder metastore_db. This is not very user friendly as the metastore_db will be locked by HiveContext and thus will block multiple Spark process to start from the same directory. I would propose adding a default hive-site.xml in conf/ with the following content. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:memory:databaseName=metastore_db;create=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property property namehive.metastore.warehouse.dir/name valuefile://${user.dir}/hive/warehouse/value /property /configuration jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the embedded derby database is created in-memory. Jianshi was: HiveContext by default will create a local folder metastore_db. This is not very user friendly as the metastore_db will be locked by HiveContext and thus will block multiple Spark process to start from the same directory. I would propose adding a default hive-site.xml in conf/ with the following content. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:memory;databaseName=metastore_db;create=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property property namehive.metastore.warehouse.dir/name valuefile://${user.dir}/hive/warehouse/value /property /configuration jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the embedded derby database is created in-memory. Jianshi Make metastore_db in-memory for HiveContext --- Key: SPARK-4758 URL: https://issues.apache.org/jira/browse/SPARK-4758 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang Priority: Minor HiveContext by default will create a local folder metastore_db. This is not very user friendly as the metastore_db will be locked by HiveContext and thus will block multiple Spark process to start from the same directory. I would propose adding a default hive-site.xml in conf/ with the following content. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:memory:databaseName=metastore_db;create=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property property namehive.metastore.warehouse.dir/name valuefile://${user.dir}/hive/warehouse/value /property /configuration jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the embedded derby database is created in-memory. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4549) Support BigInt - Decimal in convertToCatalyst in SparkSQL
Jianshi Huang created SPARK-4549: Summary: Support BigInt - Decimal in convertToCatalyst in SparkSQL Key: SPARK-4549 URL: https://issues.apache.org/jira/browse/SPARK-4549 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jianshi Huang Priority: Minor Since BigDecimal is just a wrapper around BigInt, let's also convert to BigInt to Decimal. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4549) Support BigInt - Decimal in convertToCatalyst in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4549: - Issue Type: Improvement (was: Bug) Support BigInt - Decimal in convertToCatalyst in SparkSQL -- Key: SPARK-4549 URL: https://issues.apache.org/jira/browse/SPARK-4549 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Jianshi Huang Priority: Minor Since BigDecimal is just a wrapper around BigInt, let's also convert to BigInt to Decimal. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4551) Allow auto-conversion of field names of case class from camelCase to lower_case convention
Jianshi Huang created SPARK-4551: Summary: Allow auto-conversion of field names of case class from camelCase to lower_case convention Key: SPARK-4551 URL: https://issues.apache.org/jira/browse/SPARK-4551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Jianshi Huang Priority: Minor In SQL, lower_case is the naming convention for column names. We want to keep this convention as much as possible. So when converting a RDD[case class] to SchemaRDD, let's make it an option to convert the filed names from camelCase to lower_case (e.g. unitAmount = unit_amount) My proposal: 1) Utility function for case conversion (e.g. https://gist.github.com/sidharthkuruvila/3154845)' 2) Add option convertToLowerCase to registerTempTable def registerTempTable(tableName: String, convertToLowerCase = false) Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang closed SPARK-4199. Resolution: Invalid Drop table if exists raises table not found exception in HiveContext -- Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195045#comment-14195045 ] Jianshi Huang commented on SPARK-4199: -- Turned out it was caused by wrong version of datanucleus jars in my spark build directory. Somehow I have two versions of datanucleus... After removing the wrong version, now all works. Thanks! Jianshi Drop table if exists raises table not found exception in HiveContext -- Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4199: - Summary: Drop table if exists raises table not found exception in HiveContext (was: Drop table if exists raises table not found exception ) Drop table if exists raises table not found exception in HiveContext -- Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4199) Drop table if exists raises table not found exception
Jianshi Huang created SPARK-4199: Summary: Drop table if exists raises table not found exception Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3923) All Standalone Mode services time out with each other
[ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169901#comment-14169901 ] Jianshi Huang commented on SPARK-3923: -- I have similar problem in YARN-client mode. Setting spark.akka.heartbeat.interval to 100 fixes the problem. This is a critical bug. Jianshi All Standalone Mode services time out with each other - Key: SPARK-3923 URL: https://issues.apache.org/jira/browse/SPARK-3923 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.0 Reporter: Aaron Davidson Priority: Blocker I'm seeing an issue where it seems that components in Standalone Mode (Worker, Master, Driver, and Executor) all seem to time out with each other after around 1000 seconds. Here is an example log: {code} 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID app-20141013064356- ... precisely 1000 seconds later ... 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got disassociated, removing it. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 INFO Master: Removing worker worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on ip-10-0-175-214.us-west-2.compute.internal:42918 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 14/10/13 07:00:36 INFO Master: akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it. 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$InboundPayload] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356- 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with
[jira] [Comment Edited] (SPARK-3923) All Standalone Mode services time out with each other
[ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169901#comment-14169901 ] Jianshi Huang edited comment on SPARK-3923 at 10/13/14 8:45 PM: I have similar problem in YARN-client mode. Setting spark.akka.heartbeat.interval to 100 fixes the problem. This is a critical bug. P.S. One thing made me very confused during debuggin is the error message. The important one WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@xxx:50278] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. is of Log Level WARN. Jianshi was (Author: huangjs): I have similar problem in YARN-client mode. Setting spark.akka.heartbeat.interval to 100 fixes the problem. This is a critical bug. Jianshi All Standalone Mode services time out with each other - Key: SPARK-3923 URL: https://issues.apache.org/jira/browse/SPARK-3923 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.0 Reporter: Aaron Davidson Priority: Blocker I'm seeing an issue where it seems that components in Standalone Mode (Worker, Master, Driver, and Executor) all seem to time out with each other after around 1000 seconds. Here is an example log: {code} 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID app-20141013064356- ... precisely 1000 seconds later ... 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got disassociated, removing it. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 INFO Master: Removing worker worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on ip-10-0-175-214.us-west-2.compute.internal:42918 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 14/10/13 07:00:36 INFO Master: akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it. 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$InboundPayload] from
[jira] [Created] (SPARK-3906) Support joins of multiple tables in SparkSQL (SQLContext, not HiveQL)
Jianshi Huang created SPARK-3906: Summary: Support joins of multiple tables in SparkSQL (SQLContext, not HiveQL) Key: SPARK-3906 URL: https://issues.apache.org/jira/browse/SPARK-3906 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Jianshi Huang Queries like: select * from fp inner join dts on fp.status_code = dts.status_code inner dc on fp.curr_code = dc.curr_code currently is not supported in SparkSQL. Workaround is to use subquery. However it's quite cumbersome. Please support it. :) Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang resolved SPARK-3845. -- Resolution: Fixed Fix Version/s: 1.2.0 SQLContext(...) should inherit configurations from SparkContext --- Key: SPARK-3845 URL: https://issues.apache.org/jira/browse/SPARK-3845 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Fix For: 1.2.0 It's very confusing that Spark configurations (e.g. spark.serializer, spark.speculation, etc.) can be set in the spark-default.conf file, while SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or sql(SET ...). When I do: val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) I would expect sqlContext recognizes all the SQL configurations comes with sparkContext. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164882#comment-14164882 ] Jianshi Huang commented on SPARK-3845: -- Looks like it's fixed in latest 1.2.0 snapshot. In 1.1.0, sqlContext.getAllConfs returns empty map. Jianshi SQLContext(...) should inherit configurations from SparkContext --- Key: SPARK-3845 URL: https://issues.apache.org/jira/browse/SPARK-3845 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang It's very confusing that Spark configurations (e.g. spark.serializer, spark.speculation, etc.) can be set in the spark-default.conf file, while SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or sql(SET ...). When I do: val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) I would expect sqlContext recognizes all the SQL configurations comes with sparkContext. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext
Jianshi Huang created SPARK-3845: Summary: SQLContext(...) should inherit configurations from SparkContext Key: SPARK-3845 URL: https://issues.apache.org/jira/browse/SPARK-3845 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang It's very confusing that Spark configurations (e.g. spark.serializer, spark.speculation, etc.) can be set in the spark-default.conf file, while SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or sql(SET ...). When I do: val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) I would expect sqlContext recognizes all the SQL configurations comes with sparkContext. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3846) KryoException when doing joins in SparkSQL
Jianshi Huang created SPARK-3846: Summary: KryoException when doing joins in SparkSQL Key: SPARK-3846 URL: https://issues.apache.org/jira/browse/SPARK-3846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang I built the latest Spark (1.2.0-SNAPSHOT) from master branch and found previous (1.1.0) successful jobs failed. The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, lvshdc5dn0215.lvs.paypal.com): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3846) KryoException when doing joins in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-3846: - Description: The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, lvshdc5dn0215.lvs.paypal.com): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) was: I built the latest Spark (1.2.0-SNAPSHOT) from master branch and found previous (1.1.0) successful jobs failed. The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, lvshdc5dn0215.lvs.paypal.com): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
[jira] [Closed] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang closed SPARK-2890. Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Assignee: Michael Armbrust Fix For: 1.2.0 Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3846) KryoException when doing joins in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-3846: - Description: The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198) org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) was: The error is reproducible when I join two tables manually. The error message is like follows. org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 (TID 3802, lvshdc5dn0215.lvs.paypal.com): com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095224#comment-14095224 ] Jianshi Huang commented on SPARK-2890: -- I think the fault is on my side. I should've changed project the duplicated columns into different names. So the current behavior makes sense. I'll close this issue. Jianshi Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang closed SPARK-2890. Resolution: Invalid Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093746#comment-14093746 ] Jianshi Huang commented on SPARK-2890: -- My use case: The result will be parsed into (id, type, start, end, properties) tuples. Properties might or might not contain any of (id, type, start end). So it's easier just to list them at the end and not to worry about duplicated names. Jianshi Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088939#comment-14088939 ] Jianshi Huang commented on SPARK-2890: -- In previous versions, there was no warnings but the result is buggy which contains nulls in duplicated columns. Jianshi Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
Jianshi Huang created SPARK-2890: Summary: Spark SQL should allow SELECT with duplicated columns Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080729#comment-14080729 ] Jianshi Huang commented on SPARK-2728: -- I see. Thanks for the fix Sean and Larry! Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080741#comment-14080741 ] Jianshi Huang commented on SPARK-2728: -- Anyone can test it? I'll close the issue. My build for HDP2.1 couldn't work with YARN... strange. Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
Jianshi Huang created SPARK-2728: Summary: Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)