[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2015-06-01 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567279#comment-14567279
 ] 

Jianshi Huang commented on SPARK-4782:
--

Thanks Luca for the clever fix!

I also noticed that the schema inference in JsonRDD is too JSON specific. As 
JSON's datatype is quite limited.

Jianshi

 Add inferSchema support for RDD[Map[String, Any]]
 -

 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang
Priority: Minor

 The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
 be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
 It's very inefficient.
 Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
 Schemaless data as adding Map like interface to any serialization format is 
 easy.
 So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
 serialization format we want to support, we just need to add a Map interface 
 wrapper to it*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8012) ArrayIndexOutOfBoundsException in SerializationDebugger

2015-06-01 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-8012:


 Summary: ArrayIndexOutOfBoundsException in SerializationDebugger
 Key: SPARK-8012
 URL: https://issues.apache.org/jira/browse/SPARK-8012
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Jianshi Huang


It makes NonSerializable exception less obvious.

{noformat}
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:248)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:166)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107)
at 
org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:66)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682)
at 
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:40)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at 
org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:159)
at 
org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:131)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.sources.DataSourceStrategy$.buildPartitionedTableScan(DataSourceStrategy.scala:131)
at 
org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:80)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$HashJoin$.apply(SparkStrategies.scala:109)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 

[jira] [Created] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder

2015-06-01 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-8014:


 Summary: DataFrame.write.mode(error).save(...) should not scan 
the output folder
 Key: SPARK-8014
 URL: https://issues.apache.org/jira/browse/SPARK-8014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang
Priority: Minor


I have code that set the wrong output location, but failed with strange errors, 
it scaned my ~/.Trash folder...

It turned out save will scan the output folder first before mode(error) does 
the check. 

Scanning is unnecessary for mode = error.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8012) ArrayIndexOutOfBoundsException in SerializationDebugger

2015-06-01 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568350#comment-14568350
 ] 

Jianshi Huang commented on SPARK-8012:
--

Yeah, it's from pretty big code base. I'm trying to reduce the scope.

BTW, I'm using 1.4.0-rc3.

Jianshi

 ArrayIndexOutOfBoundsException in SerializationDebugger
 ---

 Key: SPARK-8012
 URL: https://issues.apache.org/jira/browse/SPARK-8012
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Jianshi Huang

 It makes NonSerializable exception less obvious.
 {noformat}
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:248)
 at 
 org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
 at 
 org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107)
 at 
 org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:166)
 at 
 org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107)
 at 
 org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:66)
 at 
 org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
 at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
 at 
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682)
 at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682)
 at 
 org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:40)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
 at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at 
 org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
 at 
 org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:159)
 at 
 org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:131)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
 at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
 at 
 org.apache.spark.sql.sources.DataSourceStrategy$.buildPartitionedTableScan(DataSourceStrategy.scala:131)
 at 
 org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:80)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
 at 
 

[jira] [Commented] (SPARK-6297) EventLog permissions are always set to 770 which causes problems

2015-05-31 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566907#comment-14566907
 ] 

Jianshi Huang commented on SPARK-6297:
--

What about YARN? I have a problem to share event logs among our team. Spark 
processes in YARN are under user's account.

Jianshi

 EventLog permissions are always set to 770 which causes problems
 

 Key: SPARK-6297
 URL: https://issues.apache.org/jira/browse/SPARK-6297
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: All, tested in lcx running different users without 
 common group
Reporter: Lukas Stefaniak
Priority: Trivial
  Labels: newbie

 In EventLogginListener event log files permissions are set explicitly always 
 to 770. There is no way to override it in any way.
 Problem appears as exception being thrown when driver process and spark 
 master don't share same user or group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7939) Make URL partition recognition return String by default for all partition column types and values

2015-05-29 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564412#comment-14564412
 ] 

Jianshi Huang commented on SPARK-7939:
--

That would be nice, also consider disabling type inference as default behavior.

Jianshi

 Make URL partition recognition return String by default for all partition 
 column types and values
 -

 Key: SPARK-7939
 URL: https://issues.apache.org/jira/browse/SPARK-7939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang
  Labels: 1.4.1

 Imagine the following HDFS paths:
 /data/split=00
 /data/split=01
 ...
 /data/split=FF
 If I have less than or equal to 10 partitions (00, 01, ... 09), currently 
 partition recognition will treat column 'split' as integer column. 
 If I have more than 10 partitions, column 'split' will be recognized as 
 String...
 This is very confusing. *So I'm suggesting to treat partition columns as 
 String by default*, and allow user to specify types if needed.
 Another example is date:
 /data/date=2015-04-01 = 'date' is String
 /data/date=20150401 = 'date' is Int
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7939) Make URL partition recognition return String by default for all partition column types and values

2015-05-29 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-7939:
-
Summary: Make URL partition recognition return String by default for all 
partition column types and values  (was: Make URL partition recognition return 
String by default for all partition column values)

 Make URL partition recognition return String by default for all partition 
 column types and values
 -

 Key: SPARK-7939
 URL: https://issues.apache.org/jira/browse/SPARK-7939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang

 Imagine the following HDFS paths:
 /data/split=00
 /data/split=01
 ...
 /data/split=FF
 If I have less than or equal to 10 partitions (00, 01, ... 09), currently 
 partition recognition will treat column 'split' as integer column. 
 If I have more than 10 partitions, column 'split' will be recognized as 
 String...
 This is very confusing. *So I'm suggesting to treat partition columns as 
 String by default*, and allow user to specify types if needed.
 Another example is date:
 /data/date=2015-04-01 = 'date' is String
 /data/date=20150401 = 'date' is Int
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-29 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-7937:


 Summary: Cannot compare Hive named_struct. (when using argmax, 
argmin)
 Key: SPARK-7937
 URL: https://issues.apache.org/jira/browse/SPARK-7937
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang


Imagine the following SQL:

Intention: get last used bank account country.
 
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
 
= 

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-29 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-7937:
-
Description: 
Imagine the following SQL:

Intention: get last used bank account country.
 
{code:sql}
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
{code}

= 
{noformat}
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
{noformat}

  was:
Imagine the following SQL:

Intention: get last used bank account country.
 
``` sql
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
```

= 
```
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at 

[jira] [Updated] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-29 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-7937:
-
Description: 
Imagine the following SQL:

Intention: get last used bank account country.
 
``` sql
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
```

= 
```
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
```

  was:
Imagine the following SQL:

Intention: get last used bank account country.
 
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
 
= 

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at 

[jira] [Created] (SPARK-7939) Make URL partition recognition return String by default for all partition column values

2015-05-29 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-7939:


 Summary: Make URL partition recognition return String by default 
for all partition column values
 Key: SPARK-7939
 URL: https://issues.apache.org/jira/browse/SPARK-7939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang


Imagine the following HDFS paths:

/data/split=00
/data/split=01
...
/data/split=FF

If I have less than or equal to 10 partitions (00, 01, ... 09), currently 
partition recognition will treat column 'split' as integer column. 

If I have more than 10 partitions, column 'split' will be recognized as 
String...

This is very confusing. *So I'm suggesting to treat partition columns as String 
by default*, and allow user to specify types if needed.

Another example is date:
/data/date=2015-04-01 = 'date' is String
/data/date=20150401 = 'date' is Int

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-29 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564267#comment-14564267
 ] 

Jianshi Huang commented on SPARK-7937:
--

Blog for describing Hive's argmax, argmin feature: 
https://www.joefkelley.com/?p=727

HIVE JIRA: https://issues.apache.org/jira/browse/HIVE-1128

Jianshi

 Cannot compare Hive named_struct. (when using argmax, argmin)
 -

 Key: SPARK-7937
 URL: https://issues.apache.org/jira/browse/SPARK-7937
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang

 Imagine the following SQL:
 Intention: get last used bank account country.
  
 {code:sql}
 select bank_account_id, 
   max(named_struct(
 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D 
 HH:mm:ss'), 
 'bank_country', bank_country)).bank_country 
 from bank_account_monthly
 where year_month='201502' 
 group by bank_account_id
 {code}
 = 
 {noformat}
 Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in 
 stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type 
 StructType(StructField(src_row_update_ts,LongType,true), 
 StructField(bank_country,StringType,true)) does not support ordered operations
 at scala.sys.package$.error(package.scala:27)
 at 
 org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
 at 
 org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
 at 
 org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
 at 
 org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
 at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
 at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource

2015-05-19 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550028#comment-14550028
 ] 

Jianshi Huang commented on SPARK-6533:
--

Ah, right. Tested in 1.4.0, sqlc.load works!

Thanks Baishuo.

Jianshi

 Allow using wildcard and other file pattern in Parquet DataSource
 -

 Key: SPARK-6533
 URL: https://issues.apache.org/jira/browse/SPARK-6533
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang
Priority: Critical

 By default, spark.sql.parquet.useDataSourceApi is set to true. And loading 
 parquet files using file pattern will throw errors.
 *\*Wildcard*
 {noformat}
 scala val qp = 
 sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded.
 java.io.FileNotFoundException: File does not exist: 
 hdfs://.../source=live/date=2014-06-0*
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
 {noformat}
 And
 *\[abc\]*
 {noformat}
 val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12])
 java.lang.IllegalArgumentException: Illegal character in path at index 74: 
 hdfs://.../source=live/date=2014-06-0[12]
   at java.net.URI.create(URI.java:859)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
   ... 49 elided
 Caused by: java.net.URISyntaxException: Illegal character in path at index 
 74: hdfs://.../source=live/date=2014-06-0[12]
   at java.net.URI$Parser.fail(URI.java:2829)
   at java.net.URI$Parser.checkChars(URI.java:3002)
   at java.net.URI$Parser.parseHierarchical(URI.java:3086)
   at java.net.URI$Parser.parse(URI.java:3034)
   at java.net.URI.init(URI.java:595)
   at java.net.URI.create(URI.java:857)
 {noformat}
 If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition 
 discovery, schema evolution etc, but being able to specify file pattern is 
 also very important to applications.
 Please add this important feature.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource

2015-05-19 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang resolved SPARK-6533.
--
Resolution: Won't Fix

Don't use sqlc.parquetFile(...), use sqlc.load(..., parquet) instead, or the 
latest Reader/Writer API.

Jianshi

 Allow using wildcard and other file pattern in Parquet DataSource
 -

 Key: SPARK-6533
 URL: https://issues.apache.org/jira/browse/SPARK-6533
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang
Priority: Critical

 By default, spark.sql.parquet.useDataSourceApi is set to true. And loading 
 parquet files using file pattern will throw errors.
 *\*Wildcard*
 {noformat}
 scala val qp = 
 sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded.
 java.io.FileNotFoundException: File does not exist: 
 hdfs://.../source=live/date=2014-06-0*
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
 {noformat}
 And
 *\[abc\]*
 {noformat}
 val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12])
 java.lang.IllegalArgumentException: Illegal character in path at index 74: 
 hdfs://.../source=live/date=2014-06-0[12]
   at java.net.URI.create(URI.java:859)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
   ... 49 elided
 Caused by: java.net.URISyntaxException: Illegal character in path at index 
 74: hdfs://.../source=live/date=2014-06-0[12]
   at java.net.URI$Parser.fail(URI.java:2829)
   at java.net.URI$Parser.checkChars(URI.java:3002)
   at java.net.URI$Parser.parseHierarchical(URI.java:3086)
   at java.net.URI$Parser.parse(URI.java:3034)
   at java.net.URI.init(URI.java:595)
   at java.net.URI.create(URI.java:857)
 {noformat}
 If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition 
 discovery, schema evolution etc, but being able to specify file pattern is 
 also very important to applications.
 Please add this important feature.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7614) CLONE - Master fails on 2.11 with compilation error

2015-05-13 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542596#comment-14542596
 ] 

Jianshi Huang commented on SPARK-7614:
--

Yeah, it seems I cannot reopen 7399. That's why I cloned it.

Jianshi

 CLONE - Master fails on 2.11 with compilation error
 ---

 Key: SPARK-7614
 URL: https://issues.apache.org/jira/browse/SPARK-7614
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Jianshi Huang
Assignee: Tijo Thomas

 The current code in master (and 1.4 branch) fails on 2.11 with the following 
 compilation error:
 {code}
 [error] /home/ubuntu/workspace/Apache Spark (master) on 
 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in 
 object RDDOperationScope, multiple overloaded alternatives of method 
 withScope define default arguments.
 [error] private[spark] object RDDOperationScope {
 [error]   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4356) Test Scala 2.11 on Jenkins

2015-05-13 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542603#comment-14542603
 ] 

Jianshi Huang commented on SPARK-4356:
--

When can we have 2.11 build tests in Jenkins?

Jianshi

 Test Scala 2.11 on Jenkins
 --

 Key: SPARK-4356
 URL: https://issues.apache.org/jira/browse/SPARK-4356
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical

 We need to make some modifications to the test harness so that we can test 
 Scala 2.11 in Maven regularly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3056) Sort-based Aggregation

2015-05-13 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542148#comment-14542148
 ] 

Jianshi Huang commented on SPARK-3056:
--

Will [SPARK-2926] alone enough for this improvement?

 Sort-based Aggregation
 --

 Key: SPARK-3056
 URL: https://issues.apache.org/jira/browse/SPARK-3056
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, SparkSQL only support the hash-based aggregation, which may cause 
 OOM if too many identical keys in the input tuples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7614) CLONE - Master fails on 2.11 with compilation error

2015-05-13 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-7614:


 Summary: CLONE - Master fails on 2.11 with compilation error
 Key: SPARK-7614
 URL: https://issues.apache.org/jira/browse/SPARK-7614
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Jianshi Huang
Assignee: Tijo Thomas
 Fix For: 1.4.0


The current code in master (and 1.4 branch) fails on 2.11 with the following 
compilation error:

{code}
[error] /home/ubuntu/workspace/Apache Spark (master) on 
2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in 
object RDDOperationScope, multiple overloaded alternatives of method withScope 
define default arguments.
[error] private[spark] object RDDOperationScope {
[error]   ^
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7399) Master fails on 2.11 with compilation error

2015-05-13 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542589#comment-14542589
 ] 

Jianshi Huang commented on SPARK-7399:
--

Looks like another change makes it broken again.

Jianshi

 Master fails on 2.11 with compilation error
 ---

 Key: SPARK-7399
 URL: https://issues.apache.org/jira/browse/SPARK-7399
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Iulian Dragos
Assignee: Tijo Thomas
 Fix For: 1.4.0


 The current code in master (and 1.4 branch) fails on 2.11 with the following 
 compilation error:
 {code}
 [error] /home/ubuntu/workspace/Apache Spark (master) on 
 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in 
 object RDDOperationScope, multiple overloaded alternatives of method 
 withScope define default arguments.
 [error] private[spark] object RDDOperationScope {
 [error]   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11

2015-05-12 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539617#comment-14539617
 ] 

Jianshi Huang commented on SPARK-6154:
--

Thanks Aniket, If you can submit the pull request, I'm happy to try it.

Jianshi

 Support Kafka, JDBC in Scala 2.11
 -

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11

2015-05-08 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534344#comment-14534344
 ] 

Jianshi Huang commented on SPARK-6154:
--

Do you mean we need to upgrade the jline version for both 2.11 and 2.10?

Jianshi

 Support Kafka, JDBC in Scala 2.11
 -

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6561:


 Summary: Add partition support in saveAsParquet
 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang


Now ParquetRelation2 supports automatic partition discovery which is very nice. 

When we save a DataFrame into Parquet files, we also want to have it 
partitioned.

The proposed API looks like this:

{code}
def saveAsParquet(path: String, partitionColumns: Seq[String])
{code}

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6561:
-
Description: 
Now ParquetRelation2 supports automatic partition discovery which is very nice. 

When we save a DataFrame into Parquet files, we also want to have it 
partitioned.

The proposed API looks like this:

{code}
def saveAsParquetFile(path: String, partitionColumns: Seq[String])
{code}

Jianshi

  was:
Now ParquetRelation2 supports automatic partition discovery which is very nice. 

When we save a DataFrame into Parquet files, we also want to have it 
partitioned.

The proposed API looks like this:

{code}
def saveAsParquet(path: String, partitionColumns: Seq[String])
{code}

Jianshi


 Add partition support in saveAsParquet
 --

 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang

 Now ParquetRelation2 supports automatic partition discovery which is very 
 nice. 
 When we save a DataFrame into Parquet files, we also want to have it 
 partitioned.
 The proposed API looks like this:
 {code}
 def saveAsParquetFile(path: String, partitionColumns: Seq[String])
 {code}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6533) Cannot use wildcard and other file pattern in sqlContext.parquetFile if spark.sql.parquet.useDataSourceApi is not set to false

2015-03-25 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6533:


 Summary: Cannot use wildcard and other file pattern in 
sqlContext.parquetFile if spark.sql.parquet.useDataSourceApi is not set to false
 Key: SPARK-6533
 URL: https://issues.apache.org/jira/browse/SPARK-6533
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang


If spark.sql.parquet.useDataSourceApi is not set to false, which is the default.

Loading parquet files using file pattern will throw errors.

*\*Wildcard*
{noformat}
scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
java.io.FileNotFoundException: File does not exist: 
hdfs://.../source=live/date=2014-06-0*
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
{noformat}

And

*\[abc\]*
{noformat}
val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12])
java.lang.IllegalArgumentException: Illegal character in path at index 74: 
hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI.create(URI.java:859)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
  ... 49 elided
Caused by: java.net.URISyntaxException: Illegal character in path at index 74: 
hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI$Parser.fail(URI.java:2829)
  at java.net.URI$Parser.checkChars(URI.java:3002)
  at java.net.URI$Parser.parseHierarchical(URI.java:3086)
  at java.net.URI$Parser.parse(URI.java:3034)
  at java.net.URI.init(URI.java:595)
  at java.net.URI.create(URI.java:857)
{noformat}


Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource

2015-03-25 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6533:
-
Description: 
By default, spark.sql.parquet.useDataSourceApi is set to true. And loading 
parquet files using file pattern will throw errors.

*\*Wildcard*
{noformat}
scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
java.io.FileNotFoundException: File does not exist: 
hdfs://.../source=live/date=2014-06-0*
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
{noformat}

And

*\[abc\]*
{noformat}
val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12])
java.lang.IllegalArgumentException: Illegal character in path at index 74: 
hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI.create(URI.java:859)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
  ... 49 elided
Caused by: java.net.URISyntaxException: Illegal character in path at index 74: 
hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI$Parser.fail(URI.java:2829)
  at java.net.URI$Parser.checkChars(URI.java:3002)
  at java.net.URI$Parser.parseHierarchical(URI.java:3086)
  at java.net.URI$Parser.parse(URI.java:3034)
  at java.net.URI.init(URI.java:595)
  at java.net.URI.create(URI.java:857)
{noformat}

If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition 
discovery, schema evolution etc, but being able to specify file pattern is also 
very important to applications.

Please add this important feature.

Jianshi

  was:
If spark.sql.parquet.useDataSourceApi is not set to false, which is the default.

Loading parquet files using file pattern will throw errors.

*\*Wildcard*
{noformat}
scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
java.io.FileNotFoundException: File does not exist: 
hdfs://.../source=live/date=2014-06-0*
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 

[jira] [Updated] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource

2015-03-25 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6533:
-
Description: 
If spark.sql.parquet.useDataSourceApi is not set to false, which is the default.

Loading parquet files using file pattern will throw errors.

*\*Wildcard*
{noformat}
scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
java.io.FileNotFoundException: File does not exist: 
hdfs://.../source=live/date=2014-06-0*
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
{noformat}

And

*\[abc\]*
{noformat}
val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12])
java.lang.IllegalArgumentException: Illegal character in path at index 74: 
hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI.create(URI.java:859)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
  ... 49 elided
Caused by: java.net.URISyntaxException: Illegal character in path at index 74: 
hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI$Parser.fail(URI.java:2829)
  at java.net.URI$Parser.checkChars(URI.java:3002)
  at java.net.URI$Parser.parseHierarchical(URI.java:3086)
  at java.net.URI$Parser.parse(URI.java:3034)
  at java.net.URI.init(URI.java:595)
  at java.net.URI.create(URI.java:857)
{noformat}

If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition 
discovery, schema evolution etc, but being able to specify file pattern is also 
very important to applications.

Please add this important feature.

Jianshi

  was:
If spark.sql.parquet.useDataSourceApi is not set to false, which is the default.

Loading parquet files using file pattern will throw errors.

*\*Wildcard*
{noformat}
scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
java.io.FileNotFoundException: File does not exist: 
hdfs://.../source=live/date=2014-06-0*
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 

[jira] [Created] (SPARK-6432) Cannot load parquet data with partitions if not all partition columns match data columns

2015-03-20 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6432:


 Summary: Cannot load parquet data with partitions if not all 
partition columns match data columns
 Key: SPARK-6432
 URL: https://issues.apache.org/jira/browse/SPARK-6432
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang


Suppose we have a dataset in the following folder structure:

{noformat}
parquet/source=live/date=2015-03-18/
parquet/source=live/date=2015-03-19/
...
{noformat}

And the data schema has the following columns:
- id
- *event_date*
- source
- value

Where partition key source matches data column source, but partition key date 
doesn't match any columns in data.

Then we cannot load dataset in Spark using parquetFile. It reports:

org.apache.spark.sql.AnalysisException: Ambiguous references to source: 
(source#2,List()),(source#5,List());
...

Currently if partition columns has overlaps with data columns, partition 
columns have to be a subset of the data columns.

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6432) Cannot load parquet data with partitions if not all partition columns match data columns

2015-03-20 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371061#comment-14371061
 ] 

Jianshi Huang commented on SPARK-6432:
--

If no partition column appear in the data columns, then it's fine.


Jianshi

 Cannot load parquet data with partitions if not all partition columns match 
 data columns
 

 Key: SPARK-6432
 URL: https://issues.apache.org/jira/browse/SPARK-6432
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang
Assignee: Cheng Lian

 Suppose we have a dataset in the following folder structure:
 {noformat}
 parquet/source=live/date=2015-03-18/
 parquet/source=live/date=2015-03-19/
 ...
 {noformat}
 And the data schema has the following columns:
 - id
 - *event_date*
 - source
 - value
 Where partition key source matches data column source, but partition key date 
 doesn't match any columns in data.
 Then we cannot load dataset in Spark using parquetFile. It reports:
 {code}
 org.apache.spark.sql.AnalysisException: Ambiguous references to source: 
 (source#2,List()),(source#5,List());
 ...
 {code}
 Currently if partition columns has overlaps with data columns, partition 
 columns have to be a subset of the data columns.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6432) Cannot load parquet data with partitions if not all partition columns match data columns

2015-03-20 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6432:
-
Description: 
Suppose we have a dataset in the following folder structure:

{noformat}
parquet/source=live/date=2015-03-18/
parquet/source=live/date=2015-03-19/
...
{noformat}

And the data schema has the following columns:
- id
- *event_date*
- source
- value

Where partition key source matches data column source, but partition key date 
doesn't match any columns in data.

Then we cannot load dataset in Spark using parquetFile. It reports:

{code}
org.apache.spark.sql.AnalysisException: Ambiguous references to source: 
(source#2,List()),(source#5,List());
...
{code}

Currently if partition columns has overlaps with data columns, partition 
columns have to be a subset of the data columns.

Jianshi

  was:
Suppose we have a dataset in the following folder structure:

{noformat}
parquet/source=live/date=2015-03-18/
parquet/source=live/date=2015-03-19/
...
{noformat}

And the data schema has the following columns:
- id
- *event_date*
- source
- value

Where partition key source matches data column source, but partition key date 
doesn't match any columns in data.

Then we cannot load dataset in Spark using parquetFile. It reports:

org.apache.spark.sql.AnalysisException: Ambiguous references to source: 
(source#2,List()),(source#5,List());
...

Currently if partition columns has overlaps with data columns, partition 
columns have to be a subset of the data columns.

Jianshi


 Cannot load parquet data with partitions if not all partition columns match 
 data columns
 

 Key: SPARK-6432
 URL: https://issues.apache.org/jira/browse/SPARK-6432
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang

 Suppose we have a dataset in the following folder structure:
 {noformat}
 parquet/source=live/date=2015-03-18/
 parquet/source=live/date=2015-03-19/
 ...
 {noformat}
 And the data schema has the following columns:
 - id
 - *event_date*
 - source
 - value
 Where partition key source matches data column source, but partition key date 
 doesn't match any columns in data.
 Then we cannot load dataset in Spark using parquetFile. It reports:
 {code}
 org.apache.spark.sql.AnalysisException: Ambiguous references to source: 
 (source#2,List()),(source#5,List());
 ...
 {code}
 Currently if partition columns has overlaps with data columns, partition 
 columns have to be a subset of the data columns.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope

2015-03-17 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6382:
-
Description: 
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to be done 
by registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi


  was:
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to done by 
registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi



 withUDF(...) {...} for supporting temporary UDF definitions in the scope
 

 Key: SPARK-6382
 URL: https://issues.apache.org/jira/browse/SPARK-6382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang

 Currently the scope of UDF registration is global. It's unsuitable for 
 libraries that's built on top of DataFrame, as many operations has to be done 
 by registering a UDF first.
 Please provide a way for binding temporary UDFs.
 e.g.
 {code}
 withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = 
 m2 ++ m2),
 ...) {
   sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
 }
 {code}
 Also UDF registry is a mutable Hashmap, refactoring it to a immutable one 
 makes more sense.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope

2015-03-17 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6382:
-
Description: 
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that are built on top of DataFrame, as many operations has to be done 
by registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi


  was:
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to be done 
by registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi



 withUDF(...) {...} for supporting temporary UDF definitions in the scope
 

 Key: SPARK-6382
 URL: https://issues.apache.org/jira/browse/SPARK-6382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang

 Currently the scope of UDF registration is global. It's unsuitable for 
 libraries that are built on top of DataFrame, as many operations has to be 
 done by registering a UDF first.
 Please provide a way for binding temporary UDFs.
 e.g.
 {code}
 withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = 
 m2 ++ m2),
 ...) {
   sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
 }
 {code}
 Also UDF registry is a mutable Hashmap, refactoring it to a immutable one 
 makes more sense.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope

2015-03-17 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6382:


 Summary: withUDF(...) {...} for supporting temporary UDF 
definitions in the scope
 Key: SPARK-6382
 URL: https://issues.apache.org/jira/browse/SPARK-6382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang


Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to done by 
registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6363) make scala 2.11 default language

2015-03-17 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365726#comment-14365726
 ] 

Jianshi Huang commented on SPARK-6363:
--

My two cents.

The only module that's not working for 2.11 is thriftserver and 
kafka-streaming, as for Kafka they only started to support 2.11 from 0.8.2 and 
Spark kafka module depends on 0.8.0.

It should be just a change in Kafka version to make it ready for 2.11. AFAIK, 
0.8.2 client works well with previous 0.8.x server.

And we need to fix thriftserver build issue.

Jianshi

 make scala 2.11 default language
 

 Key: SPARK-6363
 URL: https://issues.apache.org/jira/browse/SPARK-6363
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: antonkulaga
Priority: Minor
  Labels: scala

 Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 
 support. So, it will be better if Spark binaries would be build with Scala 
 2.11 by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6195) Specialized in-memory column type for decimal

2015-03-17 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365549#comment-14365549
 ] 

Jianshi Huang commented on SPARK-6195:
--

Like this optimization! :)

Jianshi

 Specialized in-memory column type for decimal
 -

 Key: SPARK-6195
 URL: https://issues.apache.org/jira/browse/SPARK-6195
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.3.0, 1.2.1
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.4.0


 When building in-memory columnar representation, decimal values are currently 
 serialized via a generic serializer, which is unnecessarily slow. Since 
 decimals are actually annotated long values, we should add a specialized 
 decimal column type to speed up in-memory decimal serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2015-03-11 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356478#comment-14356478
 ] 

Jianshi Huang commented on SPARK-6277:
--

I see. Not relating to hadoop config is fine, how about env variables? It's 
quite often that I want to change one setting for particular tasks, editing 
spark-defaults.conf everytime is inconvenient. Env variables is a best fit here 
because of its dynamic scope.

Typesafe's config has similar features for having env variables and it can even 
allow it override previous settings.

  
https://github.com/typesafehub/config#optional-system-or-env-variable-overrides

Jianshi

 Allow Hadoop configurations and env variables to be referenced in 
 spark-defaults.conf
 -

 Key: SPARK-6277
 URL: https://issues.apache.org/jira/browse/SPARK-6277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I need to set spark.local.dir to use user local home instead of /tmp, but 
 currently spark-defaults.conf can only allow constant values.
 What I want to do is to write:
 bq. spark.local.dir /home/${user.name}/spark/tmp
 or
 bq. spark.local.dir /home/${USER}/spark/tmp
 Otherwise I would have to hack bin/spark-class and pass the option through 
 -Dspark.local.dir
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2015-03-11 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6277:


 Summary: Allow Hadoop configurations and env variables to be 
referenced in spark-defaults.conf
 Key: SPARK-6277
 URL: https://issues.apache.org/jira/browse/SPARK-6277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.2.1, 1.3.0
Reporter: Jianshi Huang


I need to set spark.local.dir to use user local home instead of /tmp, but 
currently spark-defaults.conf can only allow constant values.

What I want to do is to write:

bq. spark.local.dir /home/${user.name}/spark/tmp
or
bq. spark.local.dir /home/${USER}/spark/tmp

Otherwise I would have to hack bin/spark-class and pass the option through 
-Dspark.local.dir

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248
 ] 

Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:40 PM:
--

Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

bq.  spark.sql.strict_mode = true(default) / false

Jianshi


was (Author: huangjs):
Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

```
  spark.sql.strict_mode = true(default) / false
```

Jianshi

 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248
 ] 

Jianshi Huang commented on SPARK-6201:
--

Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

  spark.sql.strict_mode = true(default) / false

Jianshi

 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248
 ] 

Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:39 PM:
--

Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

```
  spark.sql.strict_mode = true(default) / false
```

Jianshi


was (Author: huangjs):
Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

  spark.sql.strict_mode = true(default) / false

Jianshi

 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2

2015-03-08 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352417#comment-14352417
 ] 

Jianshi Huang commented on SPARK-6154:
--

I see. Here's my build flag:

  -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Phadoop-2.4 
-Djava.version=1.7 -DskipTests

BTW, when will Kafka and JDBC be supported in 2.11 build?

Jianshi

 Build error with Scala 2.11 for v1.3.0-rc2
 --

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6155) Support latest Scala (2.11.6+)

2015-03-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Summary: Support latest Scala (2.11.6+)  (was: Support Scala 2.11.6+)

 Support latest Scala (2.11.6+)
 --

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+

2015-03-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Issue Type: New Feature  (was: Improvement)

 Support Scala 2.11.5+
 -

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang
Priority: Minor

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6155) Support Scala 2.11.5+

2015-03-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Priority: Major  (was: Minor)

 Support Scala 2.11.5+
 -

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6155) Support Scala 2.11.6+

2015-03-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6155:
-
Summary: Support Scala 2.11.6+  (was: Support Scala 2.11.5+)

 Support Scala 2.11.6+
 -

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6201) INSET should coerce types

2015-03-06 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6201:


 Summary: INSET should coerce types
 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.2.0, 1.3.0
Reporter: Jianshi Huang


Suppose we the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: 
\3\}}))).registerTempTable(d)
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql(select * from d where (d.a = 1 or d.a = 2)).collect
=
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql(select * from d where d.a in (1,2)).collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}

But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does.

Jianshi




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6201) INSET should coerce types

2015-03-06 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6201:
-
Description: 
Suppose we have the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: 
\3\}}))).registerTempTable(d)
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql(select * from d where (d.a = 1 or d.a = 2)).collect
=
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql(select * from d where d.a in (1,2)).collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}

*But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does.*

Jianshi


  was:
Suppose we have the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: 
\3\}}))).registerTempTable(d)
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql(select * from d where (d.a = 1 or d.a = 2)).collect
=
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql(select * from d where d.a in (1,2)).collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}

But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does.

Jianshi



 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *But it seems INSET implementation in SparkSQL doesn't coerce type 
 implicitly, where Hive does.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6201) INSET should coerce types

2015-03-06 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6201:
-
Description: 
Suppose we have the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: 
\3\}}))).registerTempTable(d)
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql(select * from d where (d.a = 1 or d.a = 2)).collect
=
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql(select * from d where d.a in (1,2)).collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}

But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does.

Jianshi


  was:
Suppose we the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: 
\3\}}))).registerTempTable(d)
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql(select * from d where (d.a = 1 or d.a = 2)).collect
=
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql(select * from d where d.a in (1,2)).collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}

But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does.

Jianshi



 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6201) INSET should coerce types

2015-03-06 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6201:
-
Description: 
Suppose we have the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: 
\3\}}))).registerTempTable(d)
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql(select * from d where (d.a = 1 or d.a = 2)).collect
=
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql(select * from d where d.a in (1,2)).collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}


*It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does. We should make SparkSQL conform to Hive's behavior, even 
though doing implicit coercion here is very confusing for comparing String and 
Int.*

Jianshi


  was:
Suppose we have the following table:

{code}
sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: 
\3\}}))).registerTempTable(d)
{code}

The schema is
{noformat}
root
 |-- a: string (nullable = true)
{noformat}

Then,

{code}
sql(select * from d where (d.a = 1 or d.a = 2)).collect
=
Array([1], [2])
{code}

where d.a and constants 1,2 will be casted to Double first and do the 
comparison as you can find it out in the plan:

{noformat}
Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
DoubleType) = CAST(2, DoubleType)))
{noformat}

However, if I use

{code}
sql(select * from d where d.a in (1,2)).collect
{code}

The result is empty.

The physical plan shows it's using INSET:
{noformat}
== Physical Plan ==
Filter a#155 INSET (1,2)
 PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
{noformat}

*But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
where Hive does.*

Jianshi



 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data

2015-03-05 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348361#comment-14348361
 ] 

Jianshi Huang edited comment on SPARK-5763 at 3/5/15 8:21 AM:
--

Upvote for this improvement.

There's a similar ticket for groupByKey as well. 
https://issues.apache.org/jira/browse/SPARK-3461

Jianshi


was (Author: huangjs):
Upvote for this improvement.

Jianshi

 Sort-based Groupby and Join to resolve skewed data
 --

 Key: SPARK-5763
 URL: https://issues.apache.org/jira/browse/SPARK-5763
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Lianhui Wang

 In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
 keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
 we can use sort-merge to resolve skewed-groupby and skewed-join.because 
 SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
 on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
 for skewed data in my test.Later i will implement sort-merge-join to resolve 
 skewed-join.
 [~rxin] [~sandyr] [~andrewor14] how about your opinions about this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data

2015-03-05 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348361#comment-14348361
 ] 

Jianshi Huang commented on SPARK-5763:
--

Upvote for this improvement.

Jianshi

 Sort-based Groupby and Join to resolve skewed data
 --

 Key: SPARK-5763
 URL: https://issues.apache.org/jira/browse/SPARK-5763
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Lianhui Wang

 In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
 keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
 we can use sort-merge to resolve skewed-groupby and skewed-join.because 
 SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
 on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
 for skewed data in my test.Later i will implement sort-merge-join to resolve 
 skewed-join.
 [~rxin] [~sandyr] [~andrewor14] how about your opinions about this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6155) Build with Scala 2.11.5 failed for Spark v1.3.0-rc2

2015-03-04 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347969#comment-14347969
 ] 

Jianshi Huang commented on SPARK-6155:
--

Yeah, please add the feature request.

Just a question. Scala 2.11.2 is only used for Spark runtime right (shaded?)? 
My application built with 2.11.5 will run fine?

Jianshi

 Build with Scala 2.11.5 failed for Spark v1.3.0-rc2
 ---

 Key: SPARK-6155
 URL: https://issues.apache.org/jira/browse/SPARK-6155
 Project: Spark
  Issue Type: Bug
Reporter: Jianshi Huang
Priority: Minor

 Just tried to build with Scala 2.11.5. failed with following error message:
 [INFO] Compiling 9 Scala sources to 
 /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
 [ERROR] 
 /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
  value withIncompleteHandler is not a member of 
 SparkIMain.this.global.PerRunReporting
 [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) = 
 isIncomplete = true) {
 [ERROR]^
 Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
 http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2

2015-03-04 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6154:
-
Component/s: SQL

 Build error with Scala 2.11 for v1.3.0-rc2
 --

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2

2015-03-03 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6154:


 Summary: Build error with Scala 2.11 for v1.3.0-rc2
 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Jianshi Huang


Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed 
when -Phive-thriftserver is enabled.

[info] Compiling 9 Scala sources to 
/home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
[error] 
/home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
5: object ConsoleReader is not a member of package jline
[error] import jline.{ConsoleReader, History}
[error]^
[warn] Class jline.Completor not found - continuing with a stub.
[warn] Class jline.ConsoleReader not found - continuing with a stub.
[error] 
/home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
65: not found: type ConsoleReader
[error] val reader = new ConsoleReader()


Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5828) Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-5828:


 Summary: Dynamic partition pattern support
 Key: SPARK-5828
 URL: https://issues.apache.org/jira/browse/SPARK-5828
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jianshi Huang


Hi,

HCatalog allows you to specify the pattern of paths for partitions, which will 
be used by dynamic partition loading.

  
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions#HCatalogDynamicPartitions-ExternalTables

Can we have similar feature in SparkSQL?

Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX

2015-02-05 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308595#comment-14308595
 ] 

Jianshi Huang commented on SPARK-4279:
--

Anyone is working on this?

 Implementing TinkerPop on top of GraphX
 ---

 Key: SPARK-4279
 URL: https://issues.apache.org/jira/browse/SPARK-4279
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Brennon York
Priority: Minor

 [TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph 
 databases and has been implemented across various graph database backends. 
 Has anyone thought about integrating the TinkerPop framework with GraphX to 
 enable GraphX as another backend? Not sure if this has been brought up or 
 not, but would certainly volunteer to spearhead this effort if the community 
 thinks it to be a good idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5446) Parquet column pruning should work for Map and Struct

2015-01-28 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-5446:


 Summary: Parquet column pruning should work for Map and Struct
 Key: SPARK-5446
 URL: https://issues.apache.org/jira/browse/SPARK-5446
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang


Consider the following query:

{code:sql}
select stddev_pop(variables.var1) stddev
from model
group by model_name
{code}

Where variables is a Struct containing many fields, similarly it can be a Map 
with many key-value pairs.

During execution, SparkSQL will shuffle the whole map or struct column instead 
of extracting the value first. The performance is very poor.

The optimized version could use a subquery:
{code:sql}
select stddev_pop(var) stddev
from (select variables.var1 as var, model_name from model) m
group by model_name
{code}

Where we extract the field/key-value only in the mapper side, so data being 
shuffled is small.

Parquet already supports reading a single field/key-value in the storage level, 
but SparkSQL currently doesn’t have optimization for it. This will be very 
useful optimization for tables having Map or Struct with many columns. 

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)

2014-12-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang closed SPARK-4781.


 Column values become all NULL after doing ALTER TABLE CHANGE for renaming 
 column names (Parquet external table in HiveContext)
 --

 Key: SPARK-4781
 URL: https://issues.apache.org/jira/browse/SPARK-4781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I have a table say created like follows:
 {code}
 CREATE EXTERNAL TABLE pmt (
   `sorted::cre_ts` string
 )
 STORED AS PARQUET
 LOCATION '...'
 {code}
 And I renamed the column from sorted::cre_ts to cre_ts by doing:
 {code}
 ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string
 {code}
 After renaming the column, the values in the column become all NULLs.
 {noformat}
 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive
 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])
 {noformat}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)

2014-12-07 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4781:
-
Issue Type: Bug  (was: Improvement)

 Column values become all NULL after doing ALTER TABLE CHANGE for renaming 
 column names (Parquet external table in HiveContext)
 --

 Key: SPARK-4781
 URL: https://issues.apache.org/jira/browse/SPARK-4781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I have a table say created like follows:
 {code}
 CREATE EXTERNAL TABLE pmt (
   `sorted::cre_ts` string
 )
 STORED AS PARQUET
 LOCATION '...'
 {code}
 And I renamed the column from sorted::cre_ts to cre_ts by doing:
 {code}
 ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string
 {code}
 After renaming the column, the values in the column become all NULLs.
 {noformat}
 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive
 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])
 {noformat}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4780) Support executing multiple statements in sql(...)

2014-12-06 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4780:


 Summary: Support executing multiple statements in sql(...)
 Key: SPARK-4780
 URL: https://issues.apache.org/jira/browse/SPARK-4780
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang
Priority: Minor


Just wondering why executing multiple statements in one sql(...) is not 
supported. And the statement cannot have ; as well.

I had to turn my hive script manually into a sequence of queries and this is 
not very user friendly.

Please think about supporting this.

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)

2014-12-06 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4781:


 Summary: Column values become all NULL after doing ALTER TABLE 
CHANGE for renaming column names (Parquet external table in HiveContext)
 Key: SPARK-4781
 URL: https://issues.apache.org/jira/browse/SPARK-4781
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang


I have a table say created like follows:

CREATE EXTERNAL TABLE pmt {
  `sorted::cre_ts` string
}

And I renamed the column from sorted::cre_ts to cre_ts by doing:

ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string

After renaming the column, the values in the column become all NULLs.

Before renaming:
scala sql(select `sorted::cre_ts` from pmt limit 1).collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
Native command: executed by Hive

After renaming:
scala sql(select cre_ts from pmt limit 1).collect
res16: Array[org.apache.spark.sql.Row] = Array([null])


Jianshi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)

2014-12-06 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4781:
-
Description: 
I have a table say created like follows:

CREATE EXTERNAL TABLE pmt (
  `sorted::cre_ts` string
)
STORED AS PARQUET
LOCATION '...'

And I renamed the column from sorted::cre_ts to cre_ts by doing:

ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string

After renaming the column, the values in the column become all NULLs.

Before renaming:
scala sql(select `sorted::cre_ts` from pmt limit 1).collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
Native command: executed by Hive

After renaming:
scala sql(select cre_ts from pmt limit 1).collect
res16: Array[org.apache.spark.sql.Row] = Array([null])


Jianshi



  was:
I have a table say created like follows:

CREATE EXTERNAL TABLE pmt {
  `sorted::cre_ts` string
}

And I renamed the column from sorted::cre_ts to cre_ts by doing:

ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string

After renaming the column, the values in the column become all NULLs.

Before renaming:
scala sql(select `sorted::cre_ts` from pmt limit 1).collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
Native command: executed by Hive

After renaming:
scala sql(select cre_ts from pmt limit 1).collect
res16: Array[org.apache.spark.sql.Row] = Array([null])


Jianshi




 Column values become all NULL after doing ALTER TABLE CHANGE for renaming 
 column names (Parquet external table in HiveContext)
 --

 Key: SPARK-4781
 URL: https://issues.apache.org/jira/browse/SPARK-4781
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I have a table say created like follows:
 CREATE EXTERNAL TABLE pmt (
   `sorted::cre_ts` string
 )
 STORED AS PARQUET
 LOCATION '...'
 And I renamed the column from sorted::cre_ts to cre_ts by doing:
 ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string
 After renaming the column, the values in the column become all NULLs.
 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive
 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)

2014-12-06 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4781:
-
Description: 
I have a table say created like follows:

{code}
CREATE EXTERNAL TABLE pmt (
  `sorted::cre_ts` string
)
STORED AS PARQUET
LOCATION '...'
{code}

And I renamed the column from sorted::cre_ts to cre_ts by doing:

{code}
ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string
{code}

After renaming the column, the values in the column become all NULLs.

{noformat}
Before renaming:
scala sql(select `sorted::cre_ts` from pmt limit 1).collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
Native command: executed by Hive

After renaming:
scala sql(select cre_ts from pmt limit 1).collect
res16: Array[org.apache.spark.sql.Row] = Array([null])
{noformat}

Jianshi



  was:
I have a table say created like follows:

CREATE EXTERNAL TABLE pmt (
  `sorted::cre_ts` string
)
STORED AS PARQUET
LOCATION '...'

And I renamed the column from sorted::cre_ts to cre_ts by doing:

ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string

After renaming the column, the values in the column become all NULLs.

Before renaming:
scala sql(select `sorted::cre_ts` from pmt limit 1).collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
Native command: executed by Hive

After renaming:
scala sql(select cre_ts from pmt limit 1).collect
res16: Array[org.apache.spark.sql.Row] = Array([null])


Jianshi




 Column values become all NULL after doing ALTER TABLE CHANGE for renaming 
 column names (Parquet external table in HiveContext)
 --

 Key: SPARK-4781
 URL: https://issues.apache.org/jira/browse/SPARK-4781
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I have a table say created like follows:
 {code}
 CREATE EXTERNAL TABLE pmt (
   `sorted::cre_ts` string
 )
 STORED AS PARQUET
 LOCATION '...'
 {code}
 And I renamed the column from sorted::cre_ts to cre_ts by doing:
 {code}
 ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string
 {code}
 After renaming the column, the values in the column become all NULLs.
 {noformat}
 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive
 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])
 {noformat}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2014-12-06 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4782:


 Summary: Add inferSchema support for RDD[Map[String, Any]]
 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jianshi Huang
Priority: Minor


The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
be converting each Map to JSON String first and use JsonRDD.inferSchema on it.

It's very inefficient.

Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
Schemaless data as adding Map like interface to any serialization format is 
easy.

So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
serialization format we want to support, we just need to add a Map interface 
wrapper to it*

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files

2014-12-05 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4760:


 Summary: ANALYZE TABLE table COMPUTE STATISTICS noscan failed 
estimating table size for tables created from Parquet files
 Key: SPARK-4760
 URL: https://issues.apache.org/jira/browse/SPARK-4760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang


In a older Spark version built around Oct. 12, I was able to use 

  ANALYZE TABLE table COMPUTE STATISTICS noscan

to get estimated table size, which is important for optimizing joins. (I'm 
joining 15 small dimension tables, and this is crucial to me).

In the more recent Spark builds, it fails to estimate the table size unless I 
remove noscan.

Here's the statistics I got using DESC EXTENDED:

old:
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}

new:
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
spark-defaults.conf and the result is unaffected (in both versions).

Looks like the Parquet support in new Hive (0.13.1) is broken?


Jianshi




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4757) Yarn-client failed to start due to Wrong FS error in distCacheMgr.addResource

2014-12-04 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4757:


 Summary: Yarn-client failed to start due to Wrong FS error in 
distCacheMgr.addResource
 Key: SPARK-4757
 URL: https://issues.apache.org/jira/browse/SPARK-4757
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang


I got the following error during Spark startup (Yarn-client mode):

14/12/04 19:33:58 INFO Client: Uploading resource 
file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar - 
hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
at 
org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
at org.apache.spark.SparkContext.init(SparkContext.scala:335)
at 
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
at $iwC$$iwC.init(console:9)
at $iwC.init(console:18)
at init(console:20)
at .init(console:24)


According to Liancheng and Andrew, this hotfix might be the root cause:

 https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce


Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4758) Make metastore_db in-memory for HiveContext

2014-12-04 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4758:


 Summary: Make metastore_db in-memory for HiveContext
 Key: SPARK-4758
 URL: https://issues.apache.org/jira/browse/SPARK-4758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang
Priority: Minor


HiveContext by default will create a local folder metastore_db.

This is not very user friendly as the metastore_db will be locked by 
HiveContext and thus will block multiple Spark process to start from the same 
directory.

I would propose adding a default hive-site.xml in conf/ with the following 
content.

configuration

  property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:derby:memory:databaseName=metastore_db;create=true/value
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valueorg.apache.derby.jdbc.EmbeddedDriver/value
  /property

  property
namehive.metastore.warehouse.dir/name
valuefile://${user.dir}/hive/warehouse/value
  /property

/configuration

jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the 
embedded derby database is created in-memory.

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4758) Make metastore_db in-memory for HiveContext

2014-12-04 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4758:
-
Description: 
HiveContext by default will create a local folder metastore_db.

This is not very user friendly as the metastore_db will be locked by 
HiveContext and thus will block multiple Spark process to start from the same 
directory.

I would propose adding a default hive-site.xml in conf/ with the following 
content.

configuration

  property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:derby:memory;databaseName=metastore_db;create=true/value
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valueorg.apache.derby.jdbc.EmbeddedDriver/value
  /property

  property
namehive.metastore.warehouse.dir/name
valuefile://${user.dir}/hive/warehouse/value
  /property

/configuration

jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the 
embedded derby database is created in-memory.

Jianshi

  was:
HiveContext by default will create a local folder metastore_db.

This is not very user friendly as the metastore_db will be locked by 
HiveContext and thus will block multiple Spark process to start from the same 
directory.

I would propose adding a default hive-site.xml in conf/ with the following 
content.

configuration

  property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:derby:memory:databaseName=metastore_db;create=true/value
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valueorg.apache.derby.jdbc.EmbeddedDriver/value
  /property

  property
namehive.metastore.warehouse.dir/name
valuefile://${user.dir}/hive/warehouse/value
  /property

/configuration

jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the 
embedded derby database is created in-memory.

Jianshi


 Make metastore_db in-memory for HiveContext
 ---

 Key: SPARK-4758
 URL: https://issues.apache.org/jira/browse/SPARK-4758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang
Priority: Minor

 HiveContext by default will create a local folder metastore_db.
 This is not very user friendly as the metastore_db will be locked by 
 HiveContext and thus will block multiple Spark process to start from the same 
 directory.
 I would propose adding a default hive-site.xml in conf/ with the following 
 content.
 configuration
   property
 namejavax.jdo.option.ConnectionURL/name
 valuejdbc:derby:memory;databaseName=metastore_db;create=true/value
   /property
   property
 namejavax.jdo.option.ConnectionDriverName/name
 valueorg.apache.derby.jdbc.EmbeddedDriver/value
   /property
   property
 namehive.metastore.warehouse.dir/name
 valuefile://${user.dir}/hive/warehouse/value
   /property
 /configuration
 jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the 
 embedded derby database is created in-memory.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4758) Make metastore_db in-memory for HiveContext

2014-12-04 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4758:
-
Description: 
HiveContext by default will create a local folder metastore_db.

This is not very user friendly as the metastore_db will be locked by 
HiveContext and thus will block multiple Spark process to start from the same 
directory.

I would propose adding a default hive-site.xml in conf/ with the following 
content.

configuration

  property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:derby:memory:databaseName=metastore_db;create=true/value
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valueorg.apache.derby.jdbc.EmbeddedDriver/value
  /property

  property
namehive.metastore.warehouse.dir/name
valuefile://${user.dir}/hive/warehouse/value
  /property

/configuration

jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the 
embedded derby database is created in-memory.

Jianshi

  was:
HiveContext by default will create a local folder metastore_db.

This is not very user friendly as the metastore_db will be locked by 
HiveContext and thus will block multiple Spark process to start from the same 
directory.

I would propose adding a default hive-site.xml in conf/ with the following 
content.

configuration

  property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:derby:memory;databaseName=metastore_db;create=true/value
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valueorg.apache.derby.jdbc.EmbeddedDriver/value
  /property

  property
namehive.metastore.warehouse.dir/name
valuefile://${user.dir}/hive/warehouse/value
  /property

/configuration

jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the 
embedded derby database is created in-memory.

Jianshi


 Make metastore_db in-memory for HiveContext
 ---

 Key: SPARK-4758
 URL: https://issues.apache.org/jira/browse/SPARK-4758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang
Priority: Minor

 HiveContext by default will create a local folder metastore_db.
 This is not very user friendly as the metastore_db will be locked by 
 HiveContext and thus will block multiple Spark process to start from the same 
 directory.
 I would propose adding a default hive-site.xml in conf/ with the following 
 content.
 configuration
   property
 namejavax.jdo.option.ConnectionURL/name
 valuejdbc:derby:memory:databaseName=metastore_db;create=true/value
   /property
   property
 namejavax.jdo.option.ConnectionDriverName/name
 valueorg.apache.derby.jdbc.EmbeddedDriver/value
   /property
   property
 namehive.metastore.warehouse.dir/name
 valuefile://${user.dir}/hive/warehouse/value
   /property
 /configuration
 jdbc:derby:memory:databaseName=metastore_db;create=true Will make sure the 
 embedded derby database is created in-memory.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4549) Support BigInt - Decimal in convertToCatalyst in SparkSQL

2014-11-21 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4549:


 Summary: Support BigInt - Decimal in convertToCatalyst in SparkSQL
 Key: SPARK-4549
 URL: https://issues.apache.org/jira/browse/SPARK-4549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jianshi Huang
Priority: Minor


Since BigDecimal is just a wrapper around BigInt, let's also convert to BigInt 
to Decimal.

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4549) Support BigInt - Decimal in convertToCatalyst in SparkSQL

2014-11-21 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4549:
-
Issue Type: Improvement  (was: Bug)

 Support BigInt - Decimal in convertToCatalyst in SparkSQL
 --

 Key: SPARK-4549
 URL: https://issues.apache.org/jira/browse/SPARK-4549
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jianshi Huang
Priority: Minor

 Since BigDecimal is just a wrapper around BigInt, let's also convert to 
 BigInt to Decimal.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4551) Allow auto-conversion of field names of case class from camelCase to lower_case convention

2014-11-21 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4551:


 Summary: Allow auto-conversion of field names of case class from 
camelCase to lower_case convention
 Key: SPARK-4551
 URL: https://issues.apache.org/jira/browse/SPARK-4551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jianshi Huang
Priority: Minor


In SQL, lower_case is the naming convention for column names. 

We want to keep this convention as much as possible. So when converting a 
RDD[case class] to SchemaRDD, let's make it an option to convert the filed 
names from camelCase to lower_case (e.g. unitAmount = unit_amount)

My proposal:
1) Utility function for case conversion (e.g. 
https://gist.github.com/sidharthkuruvila/3154845)'

2) Add option convertToLowerCase to registerTempTable

  def registerTempTable(tableName: String, convertToLowerCase = false)


Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext

2014-11-03 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang closed SPARK-4199.

Resolution: Invalid

 Drop table if exists raises table not found exception in HiveContext
 --

 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang

 Try this:
   sql(DROP TABLE IF EXISTS some_table)
 The exception looks like this:
 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
 some_table
 14/11/02 19:55:29 INFO ParseDriver: Parse Completed
 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
 end=1414986929678 duration=0
 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: 
 Table not found some_table
 org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
 at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
 at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext

2014-11-03 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195045#comment-14195045
 ] 

Jianshi Huang commented on SPARK-4199:
--

Turned out it was caused by wrong version of datanucleus jars in my spark build 
directory. Somehow I have two versions of datanucleus...

After removing the wrong version, now all works.

Thanks!

Jianshi

 Drop table if exists raises table not found exception in HiveContext
 --

 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang

 Try this:
   sql(DROP TABLE IF EXISTS some_table)
 The exception looks like this:
 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
 some_table
 14/11/02 19:55:29 INFO ParseDriver: Parse Completed
 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
 end=1414986929678 duration=0
 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: 
 Table not found some_table
 org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
 at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
 at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext

2014-11-02 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4199:
-
Summary: Drop table if exists raises table not found exception in 
HiveContext  (was: Drop table if exists raises table not found exception )

 Drop table if exists raises table not found exception in HiveContext
 --

 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang

 Try this:
   sql(DROP TABLE IF EXISTS some_table)
 The exception looks like this:
 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
 some_table
 14/11/02 19:55:29 INFO ParseDriver: Parse Completed
 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
 end=1414986929678 duration=0
 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: 
 Table not found some_table
 org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
 at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
 at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4199) Drop table if exists raises table not found exception

2014-11-02 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-4199:


 Summary: Drop table if exists raises table not found exception 
 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang


Try this:

  sql(DROP TABLE IF EXISTS some_table)

The exception looks like this:

14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
some_table
14/11/02 19:55:29 INFO ParseDriver: Parse Completed
14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
end=1414986929678 duration=0
14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table 
not found some_table
org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
at 
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
at 
org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3923) All Standalone Mode services time out with each other

2014-10-13 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169901#comment-14169901
 ] 

Jianshi Huang commented on SPARK-3923:
--

I have similar problem in YARN-client mode. Setting 
spark.akka.heartbeat.interval to 100 fixes the problem.

This is a critical bug.

Jianshi


 All Standalone Mode services time out with each other
 -

 Key: SPARK-3923
 URL: https://issues.apache.org/jira/browse/SPARK-3923
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Priority: Blocker

 I'm seeing an issue where it seems that components in Standalone Mode 
 (Worker, Master, Driver, and Executor) all seem to time out with each other 
 after around 1000 seconds. Here is an example log:
 {code}
 14/10/13 06:43:55 INFO Master: Registering worker 
 ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
 14/10/13 06:43:55 INFO Master: Registering worker 
 ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID 
 app-20141013064356-
 ... precisely 1000 seconds later ...
 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
 system 
 [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has 
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].
 14/10/13 07:00:35 INFO Master: 
 akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got 
 disassociated, removing it.
 14/10/13 07:00:35 INFO LocalActorRef: Message 
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245]
  was not delivered. [2] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/10/13 07:00:35 INFO Master: 
 akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
 disassociated, removing it.
 14/10/13 07:00:35 INFO Master: Removing worker 
 worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on 
 ip-10-0-175-214.us-west-2.compute.internal:42918
 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
 14/10/13 07:00:35 INFO Master: 
 akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
 disassociated, removing it.
 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
 system 
 [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has 
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].
 14/10/13 07:00:35 INFO LocalActorRef: Message 
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
  was not delivered. [3] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/10/13 07:00:35 INFO LocalActorRef: Message 
 [akka.remote.transport.AssociationHandle$Disassociated] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
  was not delivered. [4] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake 
 timed out or transport failure detector triggered.
 14/10/13 07:00:36 INFO Master: 
 akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got 
 disassociated, removing it.
 14/10/13 07:00:36 INFO LocalActorRef: Message 
 [akka.remote.transport.AssociationHandle$InboundPayload] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249]
  was not delivered. [5] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-
 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with 

[jira] [Comment Edited] (SPARK-3923) All Standalone Mode services time out with each other

2014-10-13 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169901#comment-14169901
 ] 

Jianshi Huang edited comment on SPARK-3923 at 10/13/14 8:45 PM:


I have similar problem in YARN-client mode. Setting 
spark.akka.heartbeat.interval to 100 fixes the problem.

This is a critical bug.

P.S.
One thing made me very confused during debuggin is the error message. The 
important one

  WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkDriver@xxx:50278] has failed, address is now gated for [5000] 
ms. Reason is: [Disassociated].

is of Log Level WARN.


Jianshi



was (Author: huangjs):
I have similar problem in YARN-client mode. Setting 
spark.akka.heartbeat.interval to 100 fixes the problem.

This is a critical bug.

Jianshi


 All Standalone Mode services time out with each other
 -

 Key: SPARK-3923
 URL: https://issues.apache.org/jira/browse/SPARK-3923
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Priority: Blocker

 I'm seeing an issue where it seems that components in Standalone Mode 
 (Worker, Master, Driver, and Executor) all seem to time out with each other 
 after around 1000 seconds. Here is an example log:
 {code}
 14/10/13 06:43:55 INFO Master: Registering worker 
 ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
 14/10/13 06:43:55 INFO Master: Registering worker 
 ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID 
 app-20141013064356-
 ... precisely 1000 seconds later ...
 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
 system 
 [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has 
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].
 14/10/13 07:00:35 INFO Master: 
 akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got 
 disassociated, removing it.
 14/10/13 07:00:35 INFO LocalActorRef: Message 
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245]
  was not delivered. [2] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/10/13 07:00:35 INFO Master: 
 akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
 disassociated, removing it.
 14/10/13 07:00:35 INFO Master: Removing worker 
 worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on 
 ip-10-0-175-214.us-west-2.compute.internal:42918
 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
 14/10/13 07:00:35 INFO Master: 
 akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
 disassociated, removing it.
 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
 system 
 [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has 
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].
 14/10/13 07:00:35 INFO LocalActorRef: Message 
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
  was not delivered. [3] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/10/13 07:00:35 INFO LocalActorRef: Message 
 [akka.remote.transport.AssociationHandle$Disassociated] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
  was not delivered. [4] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake 
 timed out or transport failure detector triggered.
 14/10/13 07:00:36 INFO Master: 
 akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got 
 disassociated, removing it.
 14/10/13 07:00:36 INFO LocalActorRef: Message 
 [akka.remote.transport.AssociationHandle$InboundPayload] from 

[jira] [Created] (SPARK-3906) Support joins of multiple tables in SparkSQL (SQLContext, not HiveQL)

2014-10-11 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-3906:


 Summary: Support joins of multiple tables in SparkSQL (SQLContext, 
not HiveQL)
 Key: SPARK-3906
 URL: https://issues.apache.org/jira/browse/SPARK-3906
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
Reporter: Jianshi Huang


Queries like:

select *
from fp
inner join dts
on fp.status_code = dts.status_code
inner dc
on fp.curr_code = dc.curr_code

currently is not supported in SparkSQL. Workaround is to use subquery.

However it's quite cumbersome. Please support it. :)

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext

2014-10-10 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang resolved SPARK-3845.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 SQLContext(...) should inherit configurations from SparkContext
 ---

 Key: SPARK-3845
 URL: https://issues.apache.org/jira/browse/SPARK-3845
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang
 Fix For: 1.2.0


 It's very confusing that Spark configurations (e.g. spark.serializer, 
 spark.speculation, etc.) can be set in the spark-default.conf file, while 
 SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, 
 spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or 
 sql(SET ...).
 When I do:
   val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
 I would expect sqlContext recognizes all the SQL configurations comes with 
 sparkContext.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext

2014-10-09 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164882#comment-14164882
 ] 

Jianshi Huang commented on SPARK-3845:
--

Looks like it's fixed in latest 1.2.0 snapshot.

In 1.1.0, sqlContext.getAllConfs returns empty map.

Jianshi

 SQLContext(...) should inherit configurations from SparkContext
 ---

 Key: SPARK-3845
 URL: https://issues.apache.org/jira/browse/SPARK-3845
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 It's very confusing that Spark configurations (e.g. spark.serializer, 
 spark.speculation, etc.) can be set in the spark-default.conf file, while 
 SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, 
 spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or 
 sql(SET ...).
 When I do:
   val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
 I would expect sqlContext recognizes all the SQL configurations comes with 
 sparkContext.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext

2014-10-08 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-3845:


 Summary: SQLContext(...) should inherit configurations from 
SparkContext
 Key: SPARK-3845
 URL: https://issues.apache.org/jira/browse/SPARK-3845
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang


It's very confusing that Spark configurations (e.g. spark.serializer, 
spark.speculation, etc.) can be set in the spark-default.conf file, while 
SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, 
spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or sql(SET 
...).

When I do:

  val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)

I would expect sqlContext recognizes all the SQL configurations comes with 
sparkContext.

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3846) KryoException when doing joins in SparkSQL

2014-10-08 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-3846:


 Summary: KryoException when doing joins in SparkSQL 
 Key: SPARK-3846
 URL: https://issues.apache.org/jira/browse/SPARK-3846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang


I built the latest Spark (1.2.0-SNAPSHOT) from master branch and found previous 
(1.1.0) successful jobs failed. 

The error is reproducible when I join two tables manually. The error message is 
like follows.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in 
stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 
(TID 3802, lvshdc5dn0215.lvs.paypal.com): 
com.esotericsoftware.kryo.KryoException:
Unable to find class: 
__wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1

com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)

com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198)

org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:56)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3846) KryoException when doing joins in SparkSQL

2014-10-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-3846:
-
  Description: 
The error is reproducible when I join two tables manually. The error message is 
like follows.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in 
stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 
(TID 3802, lvshdc5dn0215.lvs.paypal.com): 
com.esotericsoftware.kryo.KryoException:
Unable to find class: 
__wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1

com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)

com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198)

org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:56)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)

  was:
I built the latest Spark (1.2.0-SNAPSHOT) from master branch and found previous 
(1.1.0) successful jobs failed. 

The error is reproducible when I join two tables manually. The error message is 
like follows.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in 
stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 
(TID 3802, lvshdc5dn0215.lvs.paypal.com): 
com.esotericsoftware.kryo.KryoException:
Unable to find class: 
__wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1

com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)

com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)


[jira] [Closed] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-10-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang closed SPARK-2890.


 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3846) KryoException when doing joins in SparkSQL

2014-10-08 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-3846:
-
Description: 
The error is reproducible when I join two tables manually. The error message is 
like follows.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in 
stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 
(TID 3802, ...): com.esotericsoftware.kryo.KryoException:
Unable to find class: 
__wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1

com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)

com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198)

org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:56)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)

  was:
The error is reproducible when I join two tables manually. The error message is 
like follows.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 in 
stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 3.0 
(TID 3802, lvshdc5dn0215.lvs.paypal.com): 
com.esotericsoftware.kryo.KryoException:
Unable to find class: 
__wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1

com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)

com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-13 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095224#comment-14095224
 ] 

Jianshi Huang commented on SPARK-2890:
--

I think the fault is on my side. I should've changed project the duplicated 
columns into different names.

So the current behavior makes sense. I'll close this issue.

Jianshi

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-13 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang closed SPARK-2890.


Resolution: Invalid

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-11 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093746#comment-14093746
 ] 

Jianshi Huang commented on SPARK-2890:
--

My use case:

The result will be parsed into (id, type, start, end, properties) tuples. 
Properties might or might not contain any of (id, type, start end). So it's 
easier just to list them at the end and not to worry about duplicated names.

Jianshi

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-07 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088939#comment-14088939
 ] 

Jianshi Huang commented on SPARK-2890:
--

In previous versions, there was no warnings but the result is buggy which 
contains nulls in duplicated columns.

Jianshi


 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-06 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-2890:


 Summary: Spark SQL should allow SELECT with duplicated columns
 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang


Spark reported error java.lang.IllegalArgumentException with messages:

java.lang.IllegalArgumentException: requirement failed: Found fields with the 
same name.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
at 
org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
at 
org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
at 
org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)

After trial and error, it seems it's caused by duplicated columns in my select 
clause.

I made the duplication on purpose for my code to parse correctly. I think we 
should allow users to specify duplicated columns as return value.

Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080729#comment-14080729
 ] 

Jianshi Huang commented on SPARK-2728:
--

I see. Thanks for the fix Sean and Larry!

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080741#comment-14080741
 ] 

Jianshi Huang commented on SPARK-2728:
--

Anyone can test it? I'll close the issue. My build for HDP2.1 couldn't work 
with YARN... strange.

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-29 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-2728:


 Summary: Integer overflow in partition index calculation 
RangePartitioner
 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang


If the partition number are greater than 10362, then spark will report 
ArrayOutofIndex error. 

The reason is in the partition index calculation in rangeBounds:

#Line: 112
val bounds = new Array[K](partitions - 1)
for (i - 0 until partitions - 1) {
  val index = (rddSample.length - 1) * (i + 1) / partitions
  bounds(i) = rddSample(index)
}

Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.

Cast rddSample.length - 1 to Long should be enough for a fix?

Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)