Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
I would use just textFile unless you actually need a guarantee that you
will be seeing a whole file at time (textFile splits on new lines).

RDDs are immutable, so you cannot add data to them.  You can however union
two RDDs, returning a new RDD that contains all the data.

On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint  wrote:

> Michael,
> Thanks for your help.   I found a wholeTextFiles() that I can use to
> import all files in a directory.  I believe this would be the case if all
> the files existed in the same directory.  Currently the files come in by
> the hour and are in a format somewhat like this ../2014/10/01/00/filename
> and there is one file per hour.
>
> Do I create an RDD and add to it? Is that possible?  My example query
> would be select count(*) from (entire day RDD) where a=2.  "a" would exist
> in all files multiple times with multiple values.
>
> I don't see in any documentation how to import a file create an RDD then
> import another file into that RDD.   kinda like in mysql when you create a
> table import data then import more data.  This may be my ignorance because
> I am not that familiar with spark, but I would expect to import data into a
> single RDD before performing analytics on it.
>
> Thank you for your time and help on this.
>
>
> P.S. I am using python if that makes a difference.
>
> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust 
> wrote:
>
>> In general you should be able to read full directories of files as a
>> single RDD/SchemaRDD.  For documentation I'd suggest the programming
>> guides:
>>
>> http://spark.apache.org/docs/latest/quick-start.html
>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> For Avro in particular, I have been working on a library for Spark SQL.
>> Its very early code, but you can find it here:
>> https://github.com/databricks/spark-avro
>>
>> Bug reports welcome!
>>
>> Michael
>>
>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint 
>> wrote:
>>
>>> Hi,
>>>
>>> I am new to spark.  I have began to read to understand sparks RDD
>>> files as well as SparkSQL.  My question is more on how to build out the RDD
>>> files and best practices.   I have data that is broken down by hour into
>>> files on HDFS in avro format.   Do I need to create a separate RDD for each
>>> file? or using SparkSQL a separate SchemaRDD?
>>>
>>> I want to be able to pull lets say an entire day of data into spark and
>>> run some analytics on it.  Then possibly a week, a month, etc.
>>>
>>>
>>> If there is documentation on this procedure or best practives for
>>> building RDD's please point me at them.
>>>
>>> Thanks for your time,
>>>Sam
>>>
>>>
>>>
>>>
>>
>
>
> --
>
> *MAGNE**+**I**C*
>
> *Sam Flint* | *Lead Developer, Data Analytics*
>
>
>


Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
In general you should be able to read full directories of files as a single
RDD/SchemaRDD.  For documentation I'd suggest the programming guides:

http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/sql-programming-guide.html

For Avro in particular, I have been working on a library for Spark SQL.
Its very early code, but you can find it here:
https://github.com/databricks/spark-avro

Bug reports welcome!

Michael

On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint  wrote:

> Hi,
>
> I am new to spark.  I have began to read to understand sparks RDD
> files as well as SparkSQL.  My question is more on how to build out the RDD
> files and best practices.   I have data that is broken down by hour into
> files on HDFS in avro format.   Do I need to create a separate RDD for each
> file? or using SparkSQL a separate SchemaRDD?
>
> I want to be able to pull lets say an entire day of data into spark and
> run some analytics on it.  Then possibly a week, a month, etc.
>
>
> If there is documentation on this procedure or best practives for building
> RDD's please point me at them.
>
> Thanks for your time,
>Sam
>
>
>
>


NEW to spark and sparksql

2014-11-19 Thread Sam Flint
Hi,

I am new to spark.  I have began to read to understand sparks RDD files
as well as SparkSQL.  My question is more on how to build out the RDD files
and best practices.   I have data that is broken down by hour into files on
HDFS in avro format.   Do I need to create a separate RDD for each file? or
using SparkSQL a separate SchemaRDD?

I want to be able to pull lets say an entire day of data into spark and run
some analytics on it.  Then possibly a week, a month, etc.


If there is documentation on this procedure or best practives for building
RDD's please point me at them.

Thanks for your time,
   Sam


Re: SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-19 Thread Michael Armbrust
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf  wrote:
>
> Is there a better way to mock this out and test Hive/metastore with
> SparkSQL?
>

I would use TestHive which creates a fresh metastore each time it is
invoked.


SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-18 Thread Night Wolf
Hi,

Just to give some context. We are using Hive metastore with csv & Parquet
files as a part of our ETL pipeline. We query these with SparkSQL to do
some down stream work.

I'm curious whats the best way to go about testing Hive & SparkSQL? I'm
using 1.1.0

I see that the LocalHiveContext has been depreciated.
https://issues.apache.org/jira/browse/SPARK-2397

My testing strategy is as part of my Before block I basically create the
HiveContext then create the databases/tables and map them to some test
sample data files in my test resources directory.

The LocalSparkContext was useful because I could inject this as part of the
test setup and it would take care of creating the metastore and warehouse
directories for hive for me (local to my project). If I just create a Hive
context it does create the metastore_db folder locally. But the warehouse
directory is not created! Thus running a command like hc.sql("CREATE
DATABASE myDb") results in a Hive error. I also can't supply a test
hive-site.xml because it wont allow relative paths. Which means that there
is some shared directory that everyone needs to have. The only other option
is to call the setConf method like LocalSparkContext does.

Since LocalSparkContext is on the way out, I'm wondering if I'm doing
something stupid.

Is there a better way to mock this out and test Hive/metastore with
SparkSQL?

Cheers,
~N


Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Eric Zhen
Okay, thank you Micheal.

On Wed, Nov 19, 2014 at 3:45 AM, Michael Armbrust 
wrote:

> Those are probably related.  It looks like we are somehow not being thread
> safe when initializing various parts of the scala compiler.
>
> Since code gen is pretty experimental we probably won't have the resources
> to investigate backporting a fix.  However, if you can reproduce the
> problem in Spark 1.2 then please file a JIRA.
>
> On Mon, Nov 17, 2014 at 9:37 PM, Eric Zhen  wrote:
>
>> Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100
>> (65 failed)), and sometimes cause the stage to fail.
>>
>> And there is another error that I'm not sure if there is a correlation.
>>
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$
>> at
>> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:114)
>> at
>> org.apache.spark.sql.execution.Filter.conditionEvaluator$lzycompute(basicOperators.scala:55)
>> at
>> org.apache.spark.sql.execution.Filter.conditionEvaluator(basicOperators.scala:55)
>> at
>> org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:58)
>> at
>> org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:57)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>>
>> On Tue, Nov 18, 2014 at 11:41 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Interesting, I believe we have run that query with version 1.1.0 with
>>> codegen turned on and not much has changed there.  Is the error
>>> deterministic?
>>>
>>> On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen  wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
>>>>
>>>> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> What version of Spark SQL?
>>>>>
>>>>> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true,
>>>>>> we got exceptions as below, has anyone else saw these before?
>>>>>>
>>>>>> java.lang.ExceptionInInitializerError
>>>>>> at
>>>>>> org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
>>>>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>>>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>>>>> at
>>>>>> org.apache.spark.rdd.MapPartitionsRDD.

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Michael Armbrust
Those are probably related.  It looks like we are somehow not being thread
safe when initializing various parts of the scala compiler.

Since code gen is pretty experimental we probably won't have the resources
to investigate backporting a fix.  However, if you can reproduce the
problem in Spark 1.2 then please file a JIRA.

On Mon, Nov 17, 2014 at 9:37 PM, Eric Zhen  wrote:

> Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100
> (65 failed)), and sometimes cause the stage to fail.
>
> And there is another error that I'm not sure if there is a correlation.
>
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$
> at
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:114)
> at
> org.apache.spark.sql.execution.Filter.conditionEvaluator$lzycompute(basicOperators.scala:55)
> at
> org.apache.spark.sql.execution.Filter.conditionEvaluator(basicOperators.scala:55)
> at
> org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:58)
> at
> org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:57)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> On Tue, Nov 18, 2014 at 11:41 AM, Michael Armbrust  > wrote:
>
>> Interesting, I believe we have run that query with version 1.1.0 with
>> codegen turned on and not much has changed there.  Is the error
>> deterministic?
>>
>> On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen  wrote:
>>
>>> Hi Michael,
>>>
>>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
>>>
>>> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> What version of Spark SQL?
>>>>
>>>> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true,
>>>>> we got exceptions as below, has anyone else saw these before?
>>>>>
>>>>> java.lang.ExceptionInInitializerError
>>>>> at
>>>>> org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
>>>>> at
>>>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
>>>>> at
>>>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
>>>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>>>> at
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>> at
>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>> at
>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>>> at
>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100
(65 failed)), and sometimes cause the stage to fail.

And there is another error that I'm not sure if there is a correlation.

java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$
at
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.Filter.conditionEvaluator$lzycompute(basicOperators.scala:55)
at
org.apache.spark.sql.execution.Filter.conditionEvaluator(basicOperators.scala:55)
at
org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:58)
at
org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

On Tue, Nov 18, 2014 at 11:41 AM, Michael Armbrust 
wrote:

> Interesting, I believe we have run that query with version 1.1.0 with
> codegen turned on and not much has changed there.  Is the error
> deterministic?
>
> On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen  wrote:
>
>> Hi Michael,
>>
>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
>>
>> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust > > wrote:
>>
>>> What version of Spark SQL?
>>>
>>> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen  wrote:
>>>
>>>> Hi all,
>>>>
>>>> We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we
>>>> got exceptions as below, has anyone else saw these before?
>>>>
>>>> java.lang.ExceptionInInitializerError
>>>> at
>>>> org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
>>>> at
>>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
>>>> at
>>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
>>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>>> at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>> at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>> at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>> at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>> at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:744)
>>>> Caused by: java.lang.NullPointerException
>>>> at
>>>> scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
>>>> at
>>>> scala.reflect.internal.Types$UniqueType.(Types.scala:1304)
>>>> at scala.r

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
Interesting, I believe we have run that query with version 1.1.0 with
codegen turned on and not much has changed there.  Is the error
deterministic?

On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen  wrote:

> Hi Michael,
>
> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
>
> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust 
> wrote:
>
>> What version of Spark SQL?
>>
>> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen  wrote:
>>
>>> Hi all,
>>>
>>> We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we
>>> got exceptions as below, has anyone else saw these before?
>>>
>>> java.lang.ExceptionInInitializerError
>>> at
>>> org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
>>> at
>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
>>> at
>>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: java.lang.NullPointerException
>>> at
>>> scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
>>> at
>>> scala.reflect.internal.Types$UniqueType.(Types.scala:1304)
>>> at scala.reflect.internal.Types$TypeRef.(Types.scala:2341)
>>> at
>>> scala.reflect.internal.Types$NoArgsTypeRef.(Types.scala:2137)
>>> at
>>> scala.reflect.internal.Types$TypeRef$$anon$6.(Types.scala:2544)
>>> at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
>>> at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
>>> at
>>> scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
>>> at
>>> scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
>>> at
>>> scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
>>> at
>>> scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103)
>>> at
>>> scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46)
>>> at
>>> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>>> at
>>> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>>> at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>>> at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.(CodeGenerator.scala:46)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala:29)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala)
>>> ... 15 more
>>> --
>>> Best Regards
>>>
>>
>>
>
>
> --
> Best Regards
>


Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Hi Michael,

We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.

On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust 
wrote:

> What version of Spark SQL?
>
> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen  wrote:
>
>> Hi all,
>>
>> We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we
>> got exceptions as below, has anyone else saw these before?
>>
>> java.lang.ExceptionInInitializerError
>> at
>> org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
>> at
>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
>> at
>> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> Caused by: java.lang.NullPointerException
>> at
>> scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
>> at
>> scala.reflect.internal.Types$UniqueType.(Types.scala:1304)
>> at scala.reflect.internal.Types$TypeRef.(Types.scala:2341)
>> at
>> scala.reflect.internal.Types$NoArgsTypeRef.(Types.scala:2137)
>> at
>> scala.reflect.internal.Types$TypeRef$$anon$6.(Types.scala:2544)
>> at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
>> at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
>> at
>> scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
>> at
>> scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
>> at
>> scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
>> at
>> scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103)
>> at
>> scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46)
>> at
>> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>> at
>> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>> at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>> at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.(CodeGenerator.scala:46)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala:29)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala)
>> ... 15 more
>> --
>> Best Regards
>>
>
>


-- 
Best Regards


Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
What version of Spark SQL?

On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen  wrote:

> Hi all,
>
> We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we
> got exceptions as below, has anyone else saw these before?
>
> java.lang.ExceptionInInitializerError
> at
> org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.NullPointerException
> at
> scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
> at scala.reflect.internal.Types$UniqueType.(Types.scala:1304)
> at scala.reflect.internal.Types$TypeRef.(Types.scala:2341)
> at
> scala.reflect.internal.Types$NoArgsTypeRef.(Types.scala:2137)
> at
> scala.reflect.internal.Types$TypeRef$$anon$6.(Types.scala:2544)
> at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
> at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
> at scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
> at
> scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
> at
> scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
> at
> scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103)
> at
> scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46)
> at
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
> at
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
> at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
> at scala.reflect.api.Universe.typeOf(Universe.scala:59)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.(CodeGenerator.scala:46)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala:29)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala)
> ... 15 more
> --
> Best Regards
>


Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Sadhan Sood
Hi Cheng,

I tried reading the parquet file(on which we were getting the exception)
through parquet-tools and it is able to dump the file and I can read the
metadata, etc. I also loaded the file through hive table and can run a
table scan query on it as well. Let me know if I can do more to help
resolve the problem, I'll run it through a debugger and see if I can get
more information on it in the meantime.

Thanks,
Sadhan

On Sun, Nov 16, 2014 at 4:35 AM, Cheng Lian  wrote:

>  (Forgot to cc user mail list)
>
>
> On 11/16/14 4:59 PM, Cheng Lian wrote:
>
> Hey Sadhan,
>
>  Thanks for the additional information, this is helpful. Seems that some
> Parquet internal contract was broken, but I'm not sure whether it's caused
> by Spark SQL or Parquet, or even maybe the Parquet file itself was damaged
> somehow. I'm investigating this. In the meanwhile, would you mind to help
> to narrow down the problem by trying to scan exactly the same Parquet file
> with some other systems (e.g. Hive or Impala)? If other systems work, then
> there must be something wrong with Spark SQL.
>
>  Cheng
>
> On Sun, Nov 16, 2014 at 1:19 PM, Sadhan Sood 
> wrote:
>
>> Hi Cheng,
>>
>>  Thanks for your response. Here is the stack trace from yarn logs:
>>
>>  Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>> at java.util.ArrayList.elementData(ArrayList.java:418)
>> at java.util.ArrayList.get(ArrayList.java:431)
>> at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
>> at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
>> at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
>> at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
>> at 
>> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:282)
>> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
>> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
>> at 
>> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
>> at 
>> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
>> at 
>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
>> at 
>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>> ... 26 more
>>
>>
>> On Sat, Nov 15, 2014 at 9:28 AM, Cheng Lian 
>> wrote:
>>
>>>  Hi Sadhan,
>>>
>>> Could you please provide the stack trace of the
>>> ArrayIndexOutOfBoundsException (if any)? The reason why the first query
>>> succeeds is that Spark SQL doesn’t bother reading all data from the table
>>> to give COUNT(*). In the second case, however, the whole table is asked
>>> to be cached lazily via the cacheTable call, thus it’s scanned to build
>>> the in-memory columnar cache. Then thing went wrong while scanning this LZO
>>> compressed Parquet file. But unfortunately the stack trace at hand doesn’t
>>> indicate the root cause.
>>>
>>> Cheng
>>>
>>> On 11/15/14 5:28 AM, Sadhan Sood wrote:
>>>
>>> While testing SparkSQL on a bunch of parquet files (basically used to be
>>> a partition for one of our hive tables), I encountered this error:
>>>
>>>  import org.apache.spark.sql.SchemaRDD
>>> import org.apache.hadoop.fs.FileSystem;
>>> import org.apache.hadoop.conf.Configuration;
>>> import org.apache.hadoop.fs.Path;
>>>
>>>  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>
>>>  val parquetFileRDD = sqlContext.parquetFile(parquetFile)
>>> parquetFileRDD.registerTempTable("xyz_20141109")
>>> sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- works
>>> fine
>>> sqlContext.cacheTable("xyz_20141109")
>>> sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- fails
>>> with an exception
>>>
>>>   parquet.io.ParquetDecodingException: Can not read value at 0 in block
>>> -1 in file
>>> hdfs://::9000/event_logs/xyz/20141109/part-9359b87ae-a949-3ded-ac3e-3a6bda3a4f3a-r-9.lzo.parquet
>>>
>>> at
>>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>>>
>>> at
>>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>>>
>>&g

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Cheng Lian

(Forgot to cc user mail list)

On 11/16/14 4:59 PM, Cheng Lian wrote:

Hey Sadhan,

Thanks for the additional information, this is helpful. Seems that 
some Parquet internal contract was broken, but I'm not sure whether 
it's caused by Spark SQL or Parquet, or even maybe the Parquet file 
itself was damaged somehow. I'm investigating this. In the meanwhile, 
would you mind to help to narrow down the problem by trying to scan 
exactly the same Parquet file with some other systems (e.g. Hive or 
Impala)? If other systems work, then there must be something wrong 
with Spark SQL.


Cheng

On Sun, Nov 16, 2014 at 1:19 PM, Sadhan Sood <mailto:sadhan.s...@gmail.com>> wrote:


Hi Cheng,

Thanks for your response. Here is the stack trace from yarn logs:

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:418)
 at java.util.ArrayList.get(ArrayList.java:431)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
 at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
 at 
parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:282)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
 at 
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
 at 
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
 at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
 at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
 ... 26 more


On Sat, Nov 15, 2014 at 9:28 AM, Cheng Lian mailto:lian.cs@gmail.com>> wrote:

Hi Sadhan,

Could you please provide the stack trace of the
|ArrayIndexOutOfBoundsException| (if any)? The reason why the
first query succeeds is that Spark SQL doesn’t bother reading
all data from the table to give |COUNT(*)|. In the second
case, however, the whole table is asked to be cached lazily
via the |cacheTable| call, thus it’s scanned to build the
in-memory columnar cache. Then thing went wrong while scanning
this LZO compressed Parquet file. But unfortunately the stack
trace at hand doesn’t indicate the root cause.

Cheng

On 11/15/14 5:28 AM, Sadhan Sood wrote:


While testing SparkSQL on a bunch of parquet files (basically
used to be a partition for one of our hive tables), I
encountered this error:

import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val parquetFileRDD = sqlContext.parquetFile(parquetFile)
parquetFileRDD.registerTempTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM
xyz_20141109").collect() <-- works fine
sqlContext.cacheTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM
xyz_20141109").collect() <-- fails with an exception

parquet.io.ParquetDecodingException: Can not read value at 0
in block -1 in file

hdfs://::9000/event_logs/xyz/20141109/part-9359b87ae-a949-3ded-ac3e-3a6bda3a4f3a-r-9.lzo.parquet

at

parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)

at

parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)

at
org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:145)

at

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at
scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)

at
scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)

at

org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.hasNext(InMemoryColumnarTableScan.scala:136)

at
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)

at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

at
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)

at
org.apache.spark

SparkSQL exception on spark.sql.codegen

2014-11-15 Thread Eric Zhen
Hi all,

We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we got
exceptions as below, has anyone else saw these before?

java.lang.ExceptionInInitializerError
at
org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
at
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
at
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.NullPointerException
at
scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
at scala.reflect.internal.Types$UniqueType.(Types.scala:1304)
at scala.reflect.internal.Types$TypeRef.(Types.scala:2341)
at
scala.reflect.internal.Types$NoArgsTypeRef.(Types.scala:2137)
at
scala.reflect.internal.Types$TypeRef$$anon$6.(Types.scala:2544)
at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
at scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
at
scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
at
scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
at
scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103)
at
scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46)
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
at scala.reflect.api.Universe.typeOf(Universe.scala:59)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.(CodeGenerator.scala:46)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala:29)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala)
... 15 more
-- 
Best Regards


Re: SparkSQL exception on cached parquet table

2014-11-15 Thread sadhan
Hi Cheng,

Thanks for your response.Here is the stack trace from yarn logs:





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-exception-on-cached-parquet-table-tp18978p19020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL exception on cached parquet table

2014-11-15 Thread Cheng Lian

Hi Sadhan,

Could you please provide the stack trace of the 
|ArrayIndexOutOfBoundsException| (if any)? The reason why the first 
query succeeds is that Spark SQL doesn’t bother reading all data from 
the table to give |COUNT(*)|. In the second case, however, the whole 
table is asked to be cached lazily via the |cacheTable| call, thus it’s 
scanned to build the in-memory columnar cache. Then thing went wrong 
while scanning this LZO compressed Parquet file. But unfortunately the 
stack trace at hand doesn’t indicate the root cause.


Cheng

On 11/15/14 5:28 AM, Sadhan Sood wrote:

While testing SparkSQL on a bunch of parquet files (basically used to 
be a partition for one of our hive tables), I encountered this error:


import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val parquetFileRDD = sqlContext.parquetFile(parquetFile)
parquetFileRDD.registerTempTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- 
works fine

sqlContext.cacheTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- 
fails with an exception


parquet.io.ParquetDecodingException: Can not read value at 0 in block 
-1 in file 
hdfs://::9000/event_logs/xyz/20141109/part-9359b87ae-a949-3ded-ac3e-3a6bda3a4f3a-r-9.lzo.parquet


at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)


at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)


at 
org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:145)


at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)


at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)

at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)

at 
org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.hasNext(InMemoryColumnarTableScan.scala:136)


at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)


at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)


at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)


at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)


at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)


at org.apache.spark.scheduler.Task.run(Task.scala:56)

at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)


at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)


at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)


at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException



​


Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Cheng Lian
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of 
memory is used for caching. You may refer to details from here 
http://spark.apache.org/docs/latest/configuration.html


On 11/15/14 5:43 AM, Sadhan Sood wrote:

Thanks Cheng, that was helpful. I noticed from UI that only half of 
the memory per executor was being used for caching, is that true? We 
have a 2 TB sequence file dataset that we wanted to cache in our 
cluster with ~ 5TB memory but caching still failed and what looked 
like from the UI was that it used 2.5 TB of memory and almost wrote 12 
TB to disk (at which point it was useless) during the mapPartition 
stage. Also, couldn't run more than 2 executors/box (60g memory/box) 
or else it died very quickly from lesser memory/executor (not sure 
why?) although I/O seemed to be going much faster which makes sense 
because of more parallel reads.


On Thu, Nov 13, 2014 at 10:50 PM, Cheng Lian > wrote:


No, the columnar buffer is built in a small batching manner, the
batch size is controlled by the
|spark.sql.inMemoryColumnarStorage.batchSize| property. The
default value for this in master and branch-1.2 is 10,000 rows per
batch.

On 11/14/14 1:27 AM, Sadhan Sood wrote:


Thanks Chneg, Just one more question - does that mean that we
still need enough memory in the cluster to uncompress the data
before it can be compressed again or does that just read the raw
data as is?

On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian
mailto:lian.cs@gmail.com>> wrote:

Currently there’s no way to cache the compressed sequence
file directly. Spark SQL uses in-memory columnar format while
caching table rows, so we must read all the raw data and
convert them into columnar format. However, you can enable
in-memory columnar compression by setting
|spark.sql.inMemoryColumnarStorage.compressed| to |true|.
This property is already set to true by default in master
branch and branch-1.2.

On 11/13/14 7:16 AM, Sadhan Sood wrote:


We noticed while caching data from our hive tables which
contain data in compressed sequence file format that it gets
uncompressed in memory when getting cached. Is there a way
to turn this off and cache the compressed data as is ?

​



​



​


Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Sadhan Sood
Thanks Cheng, that was helpful. I noticed from UI that only half of the
memory per executor was being used for caching, is that true? We have a 2
TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB
memory but caching still failed and what looked like from the UI was that
it used 2.5 TB of memory and almost wrote 12 TB to disk (at which point it
was useless) during the mapPartition stage. Also, couldn't run more than 2
executors/box (60g memory/box) or else it died very quickly from lesser
memory/executor (not sure why?) although I/O seemed to be going much faster
which makes sense because of more parallel reads.

On Thu, Nov 13, 2014 at 10:50 PM, Cheng Lian  wrote:

>  No, the columnar buffer is built in a small batching manner, the batch
> size is controlled by the spark.sql.inMemoryColumnarStorage.batchSize
> property. The default value for this in master and branch-1.2 is 10,000
> rows per batch.
>
> On 11/14/14 1:27 AM, Sadhan Sood wrote:
>
>   Thanks Chneg, Just one more question - does that mean that we still
> need enough memory in the cluster to uncompress the data before it can be
> compressed again or does that just read the raw data as is?
>
> On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian 
> wrote:
>
>>  Currently there’s no way to cache the compressed sequence file
>> directly. Spark SQL uses in-memory columnar format while caching table
>> rows, so we must read all the raw data and convert them into columnar
>> format. However, you can enable in-memory columnar compression by setting
>> spark.sql.inMemoryColumnarStorage.compressed to true. This property is
>> already set to true by default in master branch and branch-1.2.
>>
>> On 11/13/14 7:16 AM, Sadhan Sood wrote:
>>
>> We noticed while caching data from our hive tables which contain data in
>> compressed sequence file format that it gets uncompressed in memory when
>> getting cached. Is there a way to turn this off and cache the compressed
>> data as is ?
>>
>>  ​
>>
>
>​
>


SparkSQL exception on cached parquet table

2014-11-14 Thread Sadhan Sood
While testing SparkSQL on a bunch of parquet files (basically used to be a
partition for one of our hive tables), I encountered this error:

import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val parquetFileRDD = sqlContext.parquetFile(parquetFile)
parquetFileRDD.registerTempTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- works
fine
sqlContext.cacheTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- fails
with an exception

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in
file
hdfs://::9000/event_logs/xyz/20141109/part-9359b87ae-a949-3ded-ac3e-3a6bda3a4f3a-r-9.lzo.parquet

at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)

at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)

at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)

at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)

at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryColumnarTableScan.scala:136)

at
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)

at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)

at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:56)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException


Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch 
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| 
property. The default value for this in master and branch-1.2 is 10,000 
rows per batch.


On 11/14/14 1:27 AM, Sadhan Sood wrote:

Thanks Chneg, Just one more question - does that mean that we still 
need enough memory in the cluster to uncompress the data before it can 
be compressed again or does that just read the raw data as is?


On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian > wrote:


Currently there’s no way to cache the compressed sequence file
directly. Spark SQL uses in-memory columnar format while caching
table rows, so we must read all the raw data and convert them into
columnar format. However, you can enable in-memory columnar
compression by setting
|spark.sql.inMemoryColumnarStorage.compressed| to |true|. This
property is already set to true by default in master branch and
branch-1.2.

On 11/13/14 7:16 AM, Sadhan Sood wrote:


We noticed while caching data from our hive tables which contain
data in compressed sequence file format that it gets uncompressed
in memory when getting cached. Is there a way to turn this off
and cache the compressed data as is ?

​



​


Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?

On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian  wrote:

>  Currently there’s no way to cache the compressed sequence file directly.
> Spark SQL uses in-memory columnar format while caching table rows, so we
> must read all the raw data and convert them into columnar format. However,
> you can enable in-memory columnar compression by setting
> spark.sql.inMemoryColumnarStorage.compressed to true. This property is
> already set to true by default in master branch and branch-1.2.
>
> On 11/13/14 7:16 AM, Sadhan Sood wrote:
>
>   We noticed while caching data from our hive tables which contain data
> in compressed sequence file format that it gets uncompressed in memory when
> getting cached. Is there a way to turn this off and cache the compressed
> data as is ?
>
>   ​
>


Re: loading, querying schemaRDD using SparkSQL

2014-11-13 Thread vdiwakar.malladi
Thanks Michael.

I used Parquet files and it could able to solve my initial problem to some
extent (i.e. loading data from one context and reading it from another
context). 

But there I could see another issue. I need to load the parquet file every
time I create the JavaSQLContext using parquetFile method on that (for the
creation of JavaSchemaRDD) and need to register as temp table using
registerTempTable for the querying purpose. It seems to be a problem when
our web application is in cluster mode. Because I need to load the parquet
files on each node. Could you please advice me on this?

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. 
Spark SQL uses in-memory columnar format while caching table rows, so we 
must read all the raw data and convert them into columnar format. 
However, you can enable in-memory columnar compression by setting 
|spark.sql.inMemoryColumnarStorage.compressed| to |true|. This property 
is already set to true by default in master branch and branch-1.2.


On 11/13/14 7:16 AM, Sadhan Sood wrote:

We noticed while caching data from our hive tables which contain data 
in compressed sequence file format that it gets uncompressed in memory 
when getting cached. Is there a way to turn this off and cache the 
compressed data as is ?


​


Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in
compressed sequence file format that it gets uncompressed in memory when
getting cached. Is there a way to turn this off and cache the compressed
data as is ?


Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
On re running the cache statement, from the logs I see that when
collect(stage 1) fails it always leads to mapPartition(stage 0) for one
partition to be re-run. This can be seen from the collect log as well on
the container log:

rg.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 0

The data is lzo compressed sequence file with compressed size ~ 26G. Is
there a way to understand why shuffle keeps failing for one partition. I
believe we have enough memory to store the uncompressed data in memory.

On Wed, Nov 12, 2014 at 2:50 PM, Sadhan Sood  wrote:

> This is the log output:
>
> 2014-11-12 19:07:16,561 INFO  thriftserver.SparkExecuteStatementOperation
> (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS
> SELECT * FROM xyz where date_prefix = 20141112'
>
> 2014-11-12 19:07:17,455 INFO  Configuration.deprecation
> (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.map.tasks is
> deprecated. Instead, use mapreduce.job.maps
>
> 2014-11-12 19:07:17,756 INFO  spark.SparkContext
> (Logging.scala:logInfo(59)) - Created broadcast 0 from broadcast at
> TableReader.scala:68
>
> 2014-11-12 19:07:18,292 INFO  spark.SparkContext
> (Logging.scala:logInfo(59)) - Starting job: collect at SparkPlan.scala:84
>
> 2014-11-12 19:07:22,801 INFO  mapred.FileInputFormat
> (FileInputFormat.java:listStatus(253)) - Total input paths to process : 200
>
> 2014-11-12 19:07:22,835 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Registering RDD 12 (mapPartitions at
> Exchange.scala:86)
>
> 2014-11-12 19:07:22,837 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Got job 0 (collect at SparkPlan.scala:84)
> with 1 output partitions (allowLocal=false)
>
> 2014-11-12 19:07:22,838 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Final stage: Stage 1(collect at
> SparkPlan.scala:84)
>
> 2014-11-12 19:07:22,838 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Parents of final stage: List(Stage 0)
>
> 2014-11-12 19:07:22,842 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Missing parents: List(Stage 0)
>
> 2014-11-12 19:07:22,871 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Submitting Stage 0 (MapPartitionsRDD[12] at
> mapPartitions at Exchange.scala:86), which has no missing parents
>
> 2014-11-12 19:07:22,916 INFO  spark.SparkContext
> (Logging.scala:logInfo(59)) - Created broadcast 1 from broadcast at
> DAGScheduler.scala:838
>
> 2014-11-12 19:07:22,963 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Submitting 461 missing tasks from Stage 0
> (MapPartitionsRDD[12] at mapPartitions at Exchange.scala:86)
>
> 2014-11-12 19:10:04,088 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Stage 0 (mapPartitions at Exchange.scala:86)
> finished in 161.113 s
>
> 2014-11-12 19:10:04,089 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - looking for newly runnable stages
>
> 2014-11-12 19:10:04,089 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - running: Set()
>
> 2014-11-12 19:10:04,090 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - waiting: Set(Stage 1)
>
> 2014-11-12 19:10:04,090 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - failed: Set()
>
> 2014-11-12 19:10:04,094 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Missing parents for Stage 1: List()
>
> 2014-11-12 19:10:04,097 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Submitting Stage 1 (MappedRDD[16] at map at
> SparkPlan.scala:84), which is now runnable
>
> 2014-11-12 19:10:04,112 INFO  spark.SparkContext
> (Logging.scala:logInfo(59)) - Created broadcast 2 from broadcast at
> DAGScheduler.scala:838
>
> 2014-11-12 19:10:04,115 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Submitting 1 missing tasks from Stage 1
> (MappedRDD[16] at map at SparkPlan.scala:84)
>
> 2014-11-12 19:10:08,541 ERROR cluster.YarnClientClusterScheduler
> (Logging.scala:logError(75)) - Lost executor 52 on
> ip-10-61-175-167.ec2.internal: remote Akka client disassociated
>
> 2014-11-12 19:10:08,543 WARN  remote.ReliableDeliverySupervisor
> (Slf4jLogger.scala:apply$mcV$sp(71)) - Association with remote system
> [akka.tcp://sparkExecutor@ip-10-61-175-167.ec2.internal:50918] has
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>
> 2014-11-12 19:10:08,548 ERROR cluster.YarnClientSchedulerBackend
> (Logging.scala:logError(75)) - Asked to remove non-existent executor 52
>
> 2014-11-12 19:10:08,550 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Executor lost: 52 (epoch 1)
>
> 2014-11-12 19:10:08,555 INFO  scheduler.Stage (Logging.scala:logInfo(59))
> - Stage 0 is now unavailable on executor 52 (460/461, false)
>
> 2014-11-12 19:10:08,686 INFO  scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Marking Stage 1 (collect at
> SparkPlan.scala:84) as failed due to a fetch failure from Stage 0
> (mapPartitions at Exchange.scala:86)
>
> 201

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
This is the log output:

2014-11-12 19:07:16,561 INFO  thriftserver.SparkExecuteStatementOperation
(Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS
SELECT * FROM xyz where date_prefix = 20141112'

2014-11-12 19:07:17,455 INFO  Configuration.deprecation
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps

2014-11-12 19:07:17,756 INFO  spark.SparkContext
(Logging.scala:logInfo(59)) - Created broadcast 0 from broadcast at
TableReader.scala:68

2014-11-12 19:07:18,292 INFO  spark.SparkContext
(Logging.scala:logInfo(59)) - Starting job: collect at SparkPlan.scala:84

2014-11-12 19:07:22,801 INFO  mapred.FileInputFormat
(FileInputFormat.java:listStatus(253)) - Total input paths to process : 200

2014-11-12 19:07:22,835 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Registering RDD 12 (mapPartitions at
Exchange.scala:86)

2014-11-12 19:07:22,837 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Got job 0 (collect at SparkPlan.scala:84)
with 1 output partitions (allowLocal=false)

2014-11-12 19:07:22,838 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Final stage: Stage 1(collect at
SparkPlan.scala:84)

2014-11-12 19:07:22,838 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Parents of final stage: List(Stage 0)

2014-11-12 19:07:22,842 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Missing parents: List(Stage 0)

2014-11-12 19:07:22,871 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Submitting Stage 0 (MapPartitionsRDD[12] at
mapPartitions at Exchange.scala:86), which has no missing parents

2014-11-12 19:07:22,916 INFO  spark.SparkContext
(Logging.scala:logInfo(59)) - Created broadcast 1 from broadcast at
DAGScheduler.scala:838

2014-11-12 19:07:22,963 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Submitting 461 missing tasks from Stage 0
(MapPartitionsRDD[12] at mapPartitions at Exchange.scala:86)

2014-11-12 19:10:04,088 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Stage 0 (mapPartitions at Exchange.scala:86)
finished in 161.113 s

2014-11-12 19:10:04,089 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - looking for newly runnable stages

2014-11-12 19:10:04,089 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - running: Set()

2014-11-12 19:10:04,090 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - waiting: Set(Stage 1)

2014-11-12 19:10:04,090 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - failed: Set()

2014-11-12 19:10:04,094 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Missing parents for Stage 1: List()

2014-11-12 19:10:04,097 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Submitting Stage 1 (MappedRDD[16] at map at
SparkPlan.scala:84), which is now runnable

2014-11-12 19:10:04,112 INFO  spark.SparkContext
(Logging.scala:logInfo(59)) - Created broadcast 2 from broadcast at
DAGScheduler.scala:838

2014-11-12 19:10:04,115 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Submitting 1 missing tasks from Stage 1
(MappedRDD[16] at map at SparkPlan.scala:84)

2014-11-12 19:10:08,541 ERROR cluster.YarnClientClusterScheduler
(Logging.scala:logError(75)) - Lost executor 52 on
ip-10-61-175-167.ec2.internal: remote Akka client disassociated

2014-11-12 19:10:08,543 WARN  remote.ReliableDeliverySupervisor
(Slf4jLogger.scala:apply$mcV$sp(71)) - Association with remote system
[akka.tcp://sparkExecutor@ip-10-61-175-167.ec2.internal:50918] has failed,
address is now gated for [5000] ms. Reason is: [Disassociated].

2014-11-12 19:10:08,548 ERROR cluster.YarnClientSchedulerBackend
(Logging.scala:logError(75)) - Asked to remove non-existent executor 52

2014-11-12 19:10:08,550 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Executor lost: 52 (epoch 1)

2014-11-12 19:10:08,555 INFO  scheduler.Stage (Logging.scala:logInfo(59)) -
Stage 0 is now unavailable on executor 52 (460/461, false)

2014-11-12 19:10:08,686 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Marking Stage 1 (collect at
SparkPlan.scala:84) as failed due to a fetch failure from Stage 0
(mapPartitions at Exchange.scala:86)

2014-11-12 19:10:08,686 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Stage 1 (collect at SparkPlan.scala:84)
failed in 4.571 s

2014-11-12 19:10:08,687 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Resubmitting Stage 0 (mapPartitions at
Exchange.scala:86) and Stage 1 (collect at SparkPlan.scala:84) due to fetch
failure

2014-11-12 19:10:08,908 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Resubmitting failed stages

2014-11-12 19:10:08,974 INFO  scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Submitting Stage 0 (MapPartitionsRDD[12] at
mapPartitions at Exchange.scala:86), which has no missing parents

2014-11-12 19:10:08,989 INFO  spark.SparkContext
(Logging.scala:logInfo(59)) - Created broadcast 3 from broadcast at
DAGSchedule

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory > 1TB and when trying to
cache a table partition(which is < 100G), seeing a lot of failed collect
stages in the UI and this never succeeds. Because of the failed collect, it
seems like the mapPartitions keep getting resubmitted. We have more than
enough memory so its surprising we are seeing this issue. Can someone
please help. Thanks!

The stack trace of the failed collect from UI is:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 0
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Re: loading, querying schemaRDD using SparkSQL

2014-11-06 Thread Michael Armbrust
It can, but currently that method uses the default hive serde which is not
very robust (does not deal well with \n in strings) and probably is not
super fast.  You'll also need to be using a HiveContext for it to work.

On Tue, Nov 4, 2014 at 8:20 PM, vdiwakar.malladi  wrote:

> Thanks Michael for your response.
>
> Just now, i saw saveAsTable method on JavaSchemaRDD object (in Spark 1.1.0
> API). But I couldn't find the corresponding documentation. Will that help?
> Please let me know.
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
Thanks Michael for your response.

Just now, i saw saveAsTable method on JavaSchemaRDD object (in Spark 1.1.0
API). But I couldn't find the corresponding documentation. Will that help?
Please let me know.

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu
Done.

https://issues.apache.org/jira/browse/SPARK-4226

Hoping this will make it into 1.3?  :)

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Tuesday, November 4, 2014 at 11:31 AM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - No support for subqueries in 1.2-snapshot?

This is not supported yet.  It would be great if you could open a JIRA (though 
I think apache JIRA is down ATM).

On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
I’m trying to execute a subquery inside an IN clause and am encountering an 
unsupported language feature in the parser.


java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_SUBQUERY_EXPR

TOK_SUBQUERY_OP

  in

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_FUNCTION

in

TOK_TABLE_OR_COL

  customerid

2

3

TOK_TABLE_OR_COL

  customerid


scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :

TOK_SUBQUERY_EXPR

  TOK_SUBQUERY_OP

in

  TOK_QUERY

TOK_FROM

  TOK_TABREF

TOK_TABNAME

  sparkbug

TOK_INSERT

  TOK_DESTINATION

TOK_DIR

  TOK_TMP_FILE

  TOK_SELECT

TOK_SELEXPR

  TOK_TABLE_OR_COL

customerid

  TOK_WHERE

TOK_FUNCTION

  in

  TOK_TABLE_OR_COL

customerid

  2

  3

  TOK_TABLE_OR_COL

customerid

" +



org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)



at scala.sys.package$.error(package.scala:27)

at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)

at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)

Are subqueries in predicates just not supported in 1.2? I think I’m seeing the 
same issue as:

http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html

Thanks,
-Terry






Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread Michael Armbrust
Temporary tables are local to the context that creates them (just like
RDDs).  I'd recommend saving the data out as Parquet to share it between
contexts.

On Tue, Nov 4, 2014 at 3:18 AM, vdiwakar.malladi  wrote:

> Hi,
>
> There is a need in my application to query the loaded data into
> sparkcontext
> (I mean loaded SchemaRDD from JSON file(s)). For this purpose, I created
> the
> SchemaRDD and call registerTempTable method in a standalone program and
> submited the application using spark-submit command.
>
> Then I have a web application from which I used Spark SQL to query the
> same.
> But I'm getting 'Table Not Found' error.
>
> So, my question is, is it possible to load the data from one context and
> query from other? If not, can any one advice me on this.
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Michael Armbrust
This is not supported yet.  It would be great if you could open a JIRA
(though I think apache JIRA is down ATM).

On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu  wrote:

>  I’m trying to execute a subquery inside an IN clause and am encountering
> an unsupported language feature in the parser.
>
>  java.lang.RuntimeException: Unsupported language features in query:
> select customerid from sparkbug where customerid in (select customerid from
> sparkbug where customerid in (2,3))
>
> TOK_QUERY
>
>   TOK_FROM
>
> TOK_TABREF
>
>   TOK_TABNAME
>
> sparkbug
>
>   TOK_INSERT
>
> TOK_DESTINATION
>
>   TOK_DIR
>
> TOK_TMP_FILE
>
> TOK_SELECT
>
>   TOK_SELEXPR
>
> TOK_TABLE_OR_COL
>
>   customerid
>
> TOK_WHERE
>
>   TOK_SUBQUERY_EXPR
>
> TOK_SUBQUERY_OP
>
>   in
>
> TOK_QUERY
>
>   TOK_FROM
>
> TOK_TABREF
>
>   TOK_TABNAME
>
> sparkbug
>
>   TOK_INSERT
>
> TOK_DESTINATION
>
>   TOK_DIR
>
> TOK_TMP_FILE
>
> TOK_SELECT
>
>   TOK_SELEXPR
>
> TOK_TABLE_OR_COL
>
>   customerid
>
> TOK_WHERE
>
>   TOK_FUNCTION
>
> in
>
> TOK_TABLE_OR_COL
>
>   customerid
>
> 2
>
> 3
>
> TOK_TABLE_OR_COL
>
>   customerid
>
>
>  scala.NotImplementedError: No parse rules for ASTNode type: 817, text:
> TOK_SUBQUERY_EXPR :
>
> TOK_SUBQUERY_EXPR
>
>   TOK_SUBQUERY_OP
>
> in
>
>   TOK_QUERY
>
> TOK_FROM
>
>   TOK_TABREF
>
> TOK_TABNAME
>
>   sparkbug
>
> TOK_INSERT
>
>   TOK_DESTINATION
>
> TOK_DIR
>
>   TOK_TMP_FILE
>
>   TOK_SELECT
>
> TOK_SELEXPR
>
>   TOK_TABLE_OR_COL
>
> customerid
>
>   TOK_WHERE
>
> TOK_FUNCTION
>
>   in
>
>   TOK_TABLE_OR_COL
>
> customerid
>
>   2
>
>   3
>
>   TOK_TABLE_OR_COL
>
> customerid
>
> " +
>
>
>
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
>
>
>
> at scala.sys.package$.error(package.scala:27)
>
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
>
> at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
>
> at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
>
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>
>  Are subqueries in predicates just not supported in 1.2? I think I’m
> seeing the same issue as:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html
>
>  Thanks,
> -Terry
>
>


SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu
I’m trying to execute a subquery inside an IN clause and am encountering an 
unsupported language feature in the parser.


java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_SUBQUERY_EXPR

TOK_SUBQUERY_OP

  in

TOK_QUERY

  TOK_FROM

TOK_TABREF

  TOK_TABNAME

sparkbug

  TOK_INSERT

TOK_DESTINATION

  TOK_DIR

TOK_TMP_FILE

TOK_SELECT

  TOK_SELEXPR

TOK_TABLE_OR_COL

  customerid

TOK_WHERE

  TOK_FUNCTION

in

TOK_TABLE_OR_COL

  customerid

2

3

TOK_TABLE_OR_COL

  customerid


scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :

TOK_SUBQUERY_EXPR

  TOK_SUBQUERY_OP

in

  TOK_QUERY

TOK_FROM

  TOK_TABREF

TOK_TABNAME

  sparkbug

TOK_INSERT

  TOK_DESTINATION

TOK_DIR

  TOK_TMP_FILE

  TOK_SELECT

TOK_SELEXPR

  TOK_TABLE_OR_COL

customerid

  TOK_WHERE

TOK_FUNCTION

  in

  TOK_TABLE_OR_COL

customerid

  2

  3

  TOK_TABLE_OR_COL

customerid

" +



org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)



at scala.sys.package$.error(package.scala:27)

at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)

at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)

at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)

Are subqueries in predicates just not supported in 1.2? I think I’m seeing the 
same issue as:

http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html

Thanks,
-Terry





loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
Hi,

There is a need in my application to query the loaded data into sparkcontext 
(I mean loaded SchemaRDD from JSON file(s)). For this purpose, I created the
SchemaRDD and call registerTempTable method in a standalone program and
submited the application using spark-submit command.

Then I have a web application from which I used Spark SQL to query the same.
But I'm getting 'Table Not Found' error.

So, my question is, is it possible to load the data from one context and
query from other? If not, can any one advice me on this.

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
I did some simple experiments with Impala and Spark, and Impala came out ahead. 
But it’s also less flexible, couldn’t handle irregular schemas, didn't support 
Json, and so on.

On 01.11.2014, at 02:20, Soumya Simanta  wrote:

> I agree. My personal experience with Spark core is that it performs really 
> well once you tune it properly. 
> 
> As far I understand SparkSQL under the hood performs many of these 
> optimizations (order of Spark operations) and uses a more efficient storage 
> format. Is this assumption correct? 
> 
> Has anyone done any comparison of SparkSQL with Impala ? The fact that many 
> of the queries don't even finish in the benchmark is quite surprising and 
> hard to believe. 
> 
> A few months ago there were a few emails about Spark not being able to handle 
> large volumes (TBs) of data. That myth was busted recently when the folks at 
> Databricks published their sorting record results. 
>  
> 
> Thanks
> -Soumya
> 
> 
> 
> 
>  
> 
> On Fri, Oct 31, 2014 at 7:35 PM, Du Li  wrote:
> We have seen all kinds of results published that often contradict each other. 
> My take is that the authors often know more tricks about how to tune their 
> own/familiar products than the others. So the product on focus is tuned for 
> ideal performance while the competitors are not. The authors are not 
> necessarily biased but as a consequence the results are.
> 
> Ideally it’s critical for the user community to be informed of all the 
> in-depth tuning tricks of all products. However, realistically, there is a 
> big gap in terms of documentation. Hope the Spark folks will make a 
> difference. :-)
> 
> Du
> 
> 
> From: Soumya Simanta 
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@spark.apache.org" 
> Subject: SparkSQL performance
> 
> I was really surprised to see the results here, esp. SparkSQL "not completing"
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
> 
> I was under the impression that SparkSQL performs really well because it can 
> optimize the RDD operations and load only the columns that are required. This 
> essentially means in most cases SparkSQL should be as fast as Spark is. 
> 
> I would be very interested to hear what others in the group have to say about 
> this. 
> 
> Thanks
> -Soumya
> 
> 
> 



Re: Does SparkSQL work with custom defined SerDe?

2014-11-02 Thread Chirag Aggarwal
Did  https://issues.apache.org/jira/browse/SPARK-3807 fix the issue seen by you?
If yes, then please note that it shall be part of 1.1.1 and 1.2

Chirag

From: Chen Song mailto:chen.song...@gmail.com>>
Date: Wednesday, 15 October 2014 4:03 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: Does SparkSQL work with custom defined SerDe?

Looks like it may be related to 
https://issues.apache.org/jira/browse/SPARK-3807.

I will build from branch 1.1 to see if the issue is resolved.

Chen

On Tue, Oct 14, 2014 at 10:33 AM, Chen Song 
mailto:chen.song...@gmail.com>> wrote:
Sorry for bringing this out again, as I have no clue what could have caused 
this.

I turned on DEBUG logging and did see the jar containing the SerDe class was 
scanned.

More interestingly, I saw the same exception 
(org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes) when running simple select on valid column names and malformed 
column names. This lead me to suspect that syntactical breaks somewhere.

select [valid_column] from table limit 5;
select [malformed_typo_column] from table limit 5;


On Mon, Oct 13, 2014 at 6:04 PM, Chen Song 
mailto:chen.song...@gmail.com>> wrote:
In Hive, the table was created with custom SerDe, in the following way.

row format serde "abc.ProtobufSerDe"

with serdeproperties ("serialization.class"="abc.protobuf.generated.LogA$log_a")

When I start spark-sql shell, I always got the following exception, even for a 
simple query.

select user from log_a limit 25;

I can desc the table without any problem. When I explain the query, I got the 
same exception.


14/10/13 22:01:13 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.

Exception in thread "Driver" java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Unresolved attributes: 'user, tree:

Project ['user]

 Filter (dh#4 = 2014-10-13 05)

  LowerCaseSchema

   MetastoreRelation test, log_a, None


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at 
scala.collection.TraversableOnce$class.to<http://class.to>(TraversableOnce.scala:273)

at 
scala.collection.AbstractIterator.to<http://scala.collection.AbstractIterator.to>(Iterator.scala:1157)

at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.s

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059

On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud 
wrote:

> Great! Thanks.
>
> Sent from my iPad
>
> On Nov 1, 2014, at 8:35 AM, Cheng Lian  wrote:
>
> Hi Jean,
>
> Thanks for reporting this. This is indeed a bug: some column types
> (Binary, Array, Map and Struct, and unfortunately for some reason,
> Boolean), a NoopColumnStats is used to collect column statistics, which
> causes this issue. Filed SPARK-4182 to track this issue, will fix this ASAP.
>
> Cheng
>
> On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud 
> wrote:
>
>> Hi,
>>
>> While testing SparkSQL on top of our Hive metastore, I am getting
>> some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
>> table.
>>
>> Basically, I have a table "mtable" partitioned by some "date" field in
>> hive and below is the scala code I am running in spark-shell:
>>
>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
>> val rdd_mtable = sqlContext.sql("select * from mtable where
>> date=20141028");
>> rdd_mtable.registerTempTable("rdd_mtable");
>> sqlContext.cacheTable("rdd_mtable");
>> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- OK
>> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- Exception
>>
>> So the first collect() is working just fine, however running the second
>> collect() which I expect use the cached RDD throws some
>> java.lang.ArrayIndexOutOfBoundsException, see the backtrace at the end of
>> this email. It seems the columnar traversal is crashing for some reasons.
>> FYI, I am using spark ToT (234de9232bcfa212317a8073c4a82c3863b36b14).
>>
>> java.lang.ArrayIndexOutOfBoundsException: 14
>> at
>> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
>> at
>> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
>> at
>> org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
>> at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:89)
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66)
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87)
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73)
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147)
>> at
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
>> at
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
>> at scala.Option.map(Option.scala:145)
>> at
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:122)
>> at
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:119)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> at scala.collection.TraversableOnce$class.to(Traver

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Jean-Pascal Billaud
Great! Thanks.

Sent from my iPad

> On Nov 1, 2014, at 8:35 AM, Cheng Lian  wrote:
> 
> Hi Jean,
> 
> Thanks for reporting this. This is indeed a bug: some column types (Binary, 
> Array, Map and Struct, and unfortunately for some reason, Boolean), a 
> NoopColumnStats is used to collect column statistics, which causes this 
> issue. Filed SPARK-4182 to track this issue, will fix this ASAP.
> 
> Cheng
> 
>> On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud  
>> wrote:
>> Hi,
>> 
>> While testing SparkSQL on top of our Hive metastore, I am getting some 
>> java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table.
>> 
>> Basically, I have a table "mtable" partitioned by some "date" field in hive 
>> and below is the scala code I am running in spark-shell:
>> 
>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
>> val rdd_mtable = sqlContext.sql("select * from mtable where date=20141028");
>> rdd_mtable.registerTempTable("rdd_mtable");
>> sqlContext.cacheTable("rdd_mtable");
>> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- OK
>> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- Exception
>> 
>> So the first collect() is working just fine, however running the second 
>> collect() which I expect use the cached RDD throws some 
>> java.lang.ArrayIndexOutOfBoundsException, see the backtrace at the end of 
>> this email. It seems the columnar traversal is crashing for some reasons. 
>> FYI, I am using spark ToT (234de9232bcfa212317a8073c4a82c3863b36b14).
>> 
>> java.lang.ArrayIndexOutOfBoundsException: 14
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:89)
>>  at 
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
>>  at 
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
>>  at 
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>  at 
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>  at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>>  at 
>> org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66)
>>  at 
>> org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87)
>>  at 
>> org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73)
>>  at 
>> org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147)
>>  at 
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
>>  at 
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
>>  at scala.Option.map(Option.scala:145)
>>  at 
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:122)
>>  at 
>> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:119)
>>  at 
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>>  at 
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
>>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>  at 
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>  at 
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>  at 
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>  at scala.collect

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Hi Jean,

Thanks for reporting this. This is indeed a bug: some column types (Binary,
Array, Map and Struct, and unfortunately for some reason, Boolean), a
NoopColumnStats is used to collect column statistics, which causes this
issue. Filed SPARK-4182 to track this issue, will fix this ASAP.

Cheng

On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud 
wrote:

> Hi,
>
> While testing SparkSQL on top of our Hive metastore, I am getting
> some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
> table.
>
> Basically, I have a table "mtable" partitioned by some "date" field in
> hive and below is the scala code I am running in spark-shell:
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
> val rdd_mtable = sqlContext.sql("select * from mtable where
> date=20141028");
> rdd_mtable.registerTempTable("rdd_mtable");
> sqlContext.cacheTable("rdd_mtable");
> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- OK
> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- Exception
>
> So the first collect() is working just fine, however running the second
> collect() which I expect use the cached RDD throws some
> java.lang.ArrayIndexOutOfBoundsException, see the backtrace at the end of
> this email. It seems the columnar traversal is crashing for some reasons.
> FYI, I am using spark ToT (234de9232bcfa212317a8073c4a82c3863b36b14).
>
> java.lang.ArrayIndexOutOfBoundsException: 14
> at
> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
> at
> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
> at
> org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
> at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:89)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:122)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:119)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
> at
> org

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.


Thanks
-Soumya






On Fri, Oct 31, 2014 at 7:35 PM, Du Li  wrote:

>   We have seen all kinds of results published that often contradict each
> other. My take is that the authors often know more tricks about how to tune
> their own/familiar products than the others. So the product on focus is
> tuned for ideal performance while the competitors are not. The authors are
> not necessarily biased but as a consequence the results are.
>
>  Ideally it’s critical for the user community to be informed of all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
>  Du
>
>
>   From: Soumya Simanta 
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@spark.apache.org" 
> Subject: SparkSQL performance
>
>   I was really surprised to see the results here, esp. SparkSQL "not
> completing"
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
>  I was under the impression that SparkSQL performs really well because it
> can optimize the RDD operations and load only the columns that are
> required. This essentially means in most cases SparkSQL should be as fast
> as Spark is.
>
>  I would be very interested to hear what others in the group have to say
> about this.
>
>  Thanks
> -Soumya
>
>
>


Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.


Thanks
-Soumya






On Fri, Oct 31, 2014 at 7:35 PM, Du Li  wrote:

>   We have seen all kinds of results published that often contradict each
> other. My take is that the authors often know more tricks about how to tune
> their own/familiar products than the others. So the product on focus is
> tuned for ideal performance while the competitors are not. The authors are
> not necessarily biased but as a consequence the results are.
>
>  Ideally it’s critical for the user community to be informed of all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
>  Du
>
>
>   From: Soumya Simanta 
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@spark.apache.org" 
> Subject: SparkSQL performance
>
>   I was really surprised to see the results here, esp. SparkSQL "not
> completing"
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
>  I was under the impression that SparkSQL performs really well because it
> can optimize the RDD operations and load only the columns that are
> required. This essentially means in most cases SparkSQL should be as fast
> as Spark is.
>
>  I would be very interested to hear what others in the group have to say
> about this.
>
>  Thanks
> -Soumya
>
>
>


Re: SparkSQL performance

2014-10-31 Thread Du Li
We have seen all kinds of results published that often contradict each other. 
My take is that the authors often know more tricks about how to tune their 
own/familiar products than the others. So the product on focus is tuned for 
ideal performance while the competitors are not. The authors are not 
necessarily biased but as a consequence the results are.

Ideally it’s critical for the user community to be informed of all the in-depth 
tuning tricks of all products. However, realistically, there is a big gap in 
terms of documentation. Hope the Spark folks will make a difference. :-)

Du


From: Soumya Simanta mailto:soumya.sima...@gmail.com>>
Date: Friday, October 31, 2014 at 4:04 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: SparkSQL performance

I was really surprised to see the results here, esp. SparkSQL "not completing"
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it can 
optimize the RDD operations and load only the columns that are required. This 
essentially means in most cases SparkSQL should be as fast as Spark is.

I would be very interested to hear what others in the group have to say about 
this.

Thanks
-Soumya




SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL "not
completing"
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required. This essentially means in most cases SparkSQL should be as fast
as Spark is.

I would be very interested to hear what others in the group have to say
about this.

Thanks
-Soumya


Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
OK, I created an issue. Hopefully it will be resolved soon.

Again thanks,

best,
/Shahab

On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson  wrote:

> Hi Shahab,
> The apache cassandra version looks great.
>
> I think that doing
> cc.setKeyspace("mydb")
> cc.sql("SELECT * FROM mytable")
>
> versus
> cc.setKeyspace("mydb")
> cc.sql("select * from mydb.mytable ")
>
> Is the problem? And if not, would you mind creating a ticket off-list for
> us to help further? You can create one here:
> https://github.com/datastax/spark-cassandra-connector/issues
> with tag: help wanted :)
>
> Cheers,
>
> - Helena
> @helenaedelson
>
> On Oct 31, 2014, at 1:59 PM, shahab  wrote:
>
> Thanks Helena.
> I tried setting the "KeySpace", but I got same result. I also removed
> other Cassandra dependencies,  but still same exception!
> I also tried to see if this setting appears in the CassandraSQLContext or
> not, so I printed out the output of configustion
>
> val cc = new CassandraSQLContext(sc)
>
> cc.setKeyspace("mydb")
>
> cc.conf.getAll.foreach(f => println (f._1  + " : " +  f._2))
>
> printout:
>
> spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2
>
> spark.driver.host :192.168.1.111
>
> spark.cassandra.connection.host : localhost
>
> spark.cassandra.input.split.size : 1
>
> spark.app.name : SomethingElse
>
> spark.fileserver.uri :  http://192.168.1.111:51463
>
> spark.driver.port : 51461
>
> spark.master :  local
>
> Does it have anything to do with the version of Apache Cassandra that I
> use?? I use "apache-cassandra-2.1.0"
>
>
> best,
> /Shahab
>
> The shortened SBT :
>
> "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0-beta1"
> withSources() withJavadoc(),
>
> "net.jpountz.lz4" % "lz4" % "1.2.0",
>
> "org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
> exclude("org.apache.hadoop", "hadoop-core"),
>
> "org.apache.spark" %% "spark-streaming" % "1.1.0" % "provided",
>
> "org.apache.hadoop" % "hadoop-client" % "1.0.4" % "provided",
>
> "com.github.nscala-time" %% "nscala-time" % "1.0.0",
>
> "org.scalatest" %% "scalatest" % "1.9.1" % "test",
>
> "org.apache.spark" %% "spark-sql" % "1.1.0" %  "provided",
>
> "org.apache.spark" %% "spark-hive" % "1.1.0" % "provided",
>
> "org.json4s" %% "json4s-jackson" % "3.2.5",
>
> "junit" % "junit" % "4.8.1" % "test",
>
> "org.slf4j" % "slf4j-api" % "1.7.7",
>
> "org.slf4j" % "slf4j-simple" % "1.7.7",
>
> "org.clapper" %% "grizzled-slf4j" % "1.0.2",
>
> "log4j" % "log4j" % "1.2.17"
>
> On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson <
> helena.edel...@datastax.com> wrote:
>
>> Hi Shahab,
>>
>> I’m just curious, are you explicitly needing to use thrift? Just using
>> the connector with spark does not require any thrift dependencies.
>> Simply: "com.datastax.spark" %% "spark-cassandra-connector" %
>> "1.1.0-beta1”
>>
>> But to your question, you declare the keyspace but also unnecessarily
>> repeat the keyspace.table in your select.
>> Try this instead:
>>
>> val cc = new CassandraSQLContext(sc)
>> cc.setKeyspace(“keyspaceName")
>> val result = cc.sql("SELECT * FROM tableName”) etc
>>
>> - Helena
>> @helenaedelson
>>
>> On Oct 31, 2014, at 1:25 PM, shahab  wrote:
>>
>> Hi,
>>
>> I am using the latest Cassandra-Spark Connector  to access Cassandra
>> tables form Spark. While I successfully managed to connect Cassandra using
>> CassandraRDD, the similar SparkSQL approach does not work. Here is my code
>> for both methods:
>>
>> import com.datastax.spark.connector._
>>
>> import org.apache.spark.{SparkConf, SparkContext}
>>
>> import org.apache.spark.sql._;
>>
>> import org.apache.spark.SparkContext._
>>
>> import org.apache.spark.sql.catalyst.expressions._
>>
>> import com.datastax.spark.connector.cql.Cassandr

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
Hi Shahab,
The apache cassandra version looks great.

I think that doing
  cc.setKeyspace("mydb")
  cc.sql("SELECT * FROM mytable")

versus  
  cc.setKeyspace("mydb")
  cc.sql("select * from mydb.mytable ") 

Is the problem? And if not, would you mind creating a ticket off-list for us to 
help further? You can create one here:
https://github.com/datastax/spark-cassandra-connector/issues
with tag: help wanted :)

Cheers,

- Helena
@helenaedelson

On Oct 31, 2014, at 1:59 PM, shahab  wrote:

> Thanks Helena.
> I tried setting the "KeySpace", but I got same result. I also removed other 
> Cassandra dependencies,  but still same exception!
> I also tried to see if this setting appears in the CassandraSQLContext or 
> not, so I printed out the output of configustion
> 
> val cc = new CassandraSQLContext(sc)
> 
> cc.setKeyspace("mydb")
> 
> cc.conf.getAll.foreach(f => println (f._1  + " : " +  f._2))
> 
> printout: 
> 
> spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2
> 
> spark.driver.host :192.168.1.111
> 
> spark.cassandra.connection.host : localhost
> 
> spark.cassandra.input.split.size : 1
> 
> spark.app.name : SomethingElse
> 
> spark.fileserver.uri :  http://192.168.1.111:51463
> 
> spark.driver.port : 51461
> 
> 
> spark.master :  local
> 
> 
> Does it have anything to do with the version of Apache Cassandra that I use?? 
> I use "apache-cassandra-2.1.0"
> 
> 
> best,
> /Shahab
> 
> The shortened SBT :
> "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0-beta1" 
> withSources() withJavadoc(),
> 
> "net.jpountz.lz4" % "lz4" % "1.2.0",
> 
> "org.apache.spark" %% "spark-core" % "1.1.0" % "provided" 
> exclude("org.apache.hadoop", "hadoop-core"),
> 
> "org.apache.spark" %% "spark-streaming" % "1.1.0" % "provided",
> 
> "org.apache.hadoop" % "hadoop-client" % "1.0.4" % "provided",
> 
> "com.github.nscala-time" %% "nscala-time" % "1.0.0",
> 
> "org.scalatest" %% "scalatest" % "1.9.1" % "test",
> 
> "org.apache.spark" %% "spark-sql" % "1.1.0" %  "provided",
> 
> "org.apache.spark" %% "spark-hive" % "1.1.0" % "provided",
> 
> "org.json4s" %% "json4s-jackson" % "3.2.5",
> 
> "junit" % "junit" % "4.8.1" % "test",
> 
> "org.slf4j" % "slf4j-api" % "1.7.7",
> 
> "org.slf4j" % "slf4j-simple" % "1.7.7",
> 
> "org.clapper" %% "grizzled-slf4j" % "1.0.2",
> 
> "log4j" % "log4j" % "1.2.17"
> 
> 
> On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson  
> wrote:
> Hi Shahab,
> 
> I’m just curious, are you explicitly needing to use thrift? Just using the 
> connector with spark does not require any thrift dependencies.
> Simply: "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0-beta1”
> 
> But to your question, you declare the keyspace but also unnecessarily repeat 
> the keyspace.table in your select.
> Try this instead:
> 
> val cc = new CassandraSQLContext(sc)
> cc.setKeyspace(“keyspaceName")
> val result = cc.sql("SELECT * FROM tableName”) etc
> 
> - Helena
> @helenaedelson
> 
> On Oct 31, 2014, at 1:25 PM, shahab  wrote:
> 
>> Hi,
>> 
>> I am using the latest Cassandra-Spark Connector  to access Cassandra tables 
>> form Spark. While I successfully managed to connect Cassandra using 
>> CassandraRDD, the similar SparkSQL approach does not work. Here is my code 
>> for both methods:
>> 
>> import com.datastax.spark.connector._
>> 
>> import org.apache.spark.{SparkConf, SparkContext}
>> 
>> import org.apache.spark.sql._;
>> 
>> import org.apache.spark.SparkContext._
>> 
>> import org.apache.spark.sql.catalyst.expressions._
>> 
>> import com.datastax.spark.connector.cql.CassandraConnector
>> 
>> import org.apache.spark.sql.cassandra.CassandraSQLContext
>> 
>> 
>> 
>>   val conf = new SparkConf().setAppName("SomethingElse")  
>> 
>>.setMaster("local")
>> 
>> .set("spark.cassandra

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Thanks Helena.
I tried setting the "KeySpace", but I got same result. I also removed other
Cassandra dependencies,  but still same exception!
I also tried to see if this setting appears in the CassandraSQLContext or
not, so I printed out the output of configustion

val cc = new CassandraSQLContext(sc)

cc.setKeyspace("mydb")

cc.conf.getAll.foreach(f => println (f._1  + " : " +  f._2))

printout:

spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2

spark.driver.host :192.168.1.111

spark.cassandra.connection.host : localhost

spark.cassandra.input.split.size : 1

spark.app.name : SomethingElse

spark.fileserver.uri :  http://192.168.1.111:51463

spark.driver.port : 51461

spark.master :  local

Does it have anything to do with the version of Apache Cassandra that I
use?? I use "apache-cassandra-2.1.0"


best,
/Shahab

The shortened SBT :

"com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0-beta1"
withSources() withJavadoc(),

"net.jpountz.lz4" % "lz4" % "1.2.0",

"org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
exclude("org.apache.hadoop", "hadoop-core"),

"org.apache.spark" %% "spark-streaming" % "1.1.0" % "provided",

"org.apache.hadoop" % "hadoop-client" % "1.0.4" % "provided",

"com.github.nscala-time" %% "nscala-time" % "1.0.0",

"org.scalatest" %% "scalatest" % "1.9.1" % "test",

"org.apache.spark" %% "spark-sql" % "1.1.0" %  "provided",

"org.apache.spark" %% "spark-hive" % "1.1.0" % "provided",

"org.json4s" %% "json4s-jackson" % "3.2.5",

"junit" % "junit" % "4.8.1" % "test",

"org.slf4j" % "slf4j-api" % "1.7.7",

"org.slf4j" % "slf4j-simple" % "1.7.7",

"org.clapper" %% "grizzled-slf4j" % "1.0.2",

"log4j" % "log4j" % "1.2.17"

On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson  wrote:

> Hi Shahab,
>
> I’m just curious, are you explicitly needing to use thrift? Just using the
> connector with spark does not require any thrift dependencies.
> Simply: "com.datastax.spark" %% "spark-cassandra-connector" %
> "1.1.0-beta1”
>
> But to your question, you declare the keyspace but also unnecessarily
> repeat the keyspace.table in your select.
> Try this instead:
>
> val cc = new CassandraSQLContext(sc)
> cc.setKeyspace(“keyspaceName")
> val result = cc.sql("SELECT * FROM tableName”) etc
>
> - Helena
> @helenaedelson
>
> On Oct 31, 2014, at 1:25 PM, shahab  wrote:
>
> Hi,
>
> I am using the latest Cassandra-Spark Connector  to access Cassandra
> tables form Spark. While I successfully managed to connect Cassandra using
> CassandraRDD, the similar SparkSQL approach does not work. Here is my code
> for both methods:
>
> import com.datastax.spark.connector._
>
> import org.apache.spark.{SparkConf, SparkContext}
>
> import org.apache.spark.sql._;
>
> import org.apache.spark.SparkContext._
>
> import org.apache.spark.sql.catalyst.expressions._
>
> import com.datastax.spark.connector.cql.CassandraConnector
>
> import org.apache.spark.sql.cassandra.CassandraSQLContext
>
>
>   val conf = new SparkConf().setAppName("SomethingElse")
>
>.setMaster("local")
>
> .set("spark.cassandra.connection.host", "localhost")
>
> val sc: SparkContext = new SparkContext(conf)
>
>  val rdd = sc.cassandraTable("mydb", "mytable")  // this works
>
> But:
>
> val cc = new CassandraSQLContext(sc)
>
>  cc.setKeyspace("mydb")
>
>  val srdd: SchemaRDD = cc.sql("select * from mydb.mytable ")
>
> println ("count : " +  srdd.count) // does not work
>
> Exception is thrown:
>
> Exception in thread "main"
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.util.NoSuchElementException: key not found: mydb3.inverseeventtype
>
> at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
>
> at com.google.common.cache.LocalCache.get(LocalCache.java:3934)
>
> at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)
>
> 
>
>
> in fact mydb3 is anothery keyspace which I did not tried even to connect
> to it !
>
>
>

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
Hi Shahab,

I’m just curious, are you explicitly needing to use thrift? Just using the 
connector with spark does not require any thrift dependencies.
Simply: "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0-beta1”

But to your question, you declare the keyspace but also unnecessarily repeat 
the keyspace.table in your select.
Try this instead:

val cc = new CassandraSQLContext(sc)
cc.setKeyspace(“keyspaceName")
val result = cc.sql("SELECT * FROM tableName”) etc

- Helena
@helenaedelson

On Oct 31, 2014, at 1:25 PM, shahab  wrote:

> Hi,
> 
> I am using the latest Cassandra-Spark Connector  to access Cassandra tables 
> form Spark. While I successfully managed to connect Cassandra using 
> CassandraRDD, the similar SparkSQL approach does not work. Here is my code 
> for both methods:
> 
> import com.datastax.spark.connector._
> 
> import org.apache.spark.{SparkConf, SparkContext}
> 
> import org.apache.spark.sql._;
> 
> import org.apache.spark.SparkContext._
> 
> import org.apache.spark.sql.catalyst.expressions._
> 
> import com.datastax.spark.connector.cql.CassandraConnector
> 
> import org.apache.spark.sql.cassandra.CassandraSQLContext
> 
> 
> 
>   val conf = new SparkConf().setAppName("SomethingElse")  
> 
>.setMaster("local")
> 
> .set("spark.cassandra.connection.host", "localhost")
> 
> val sc: SparkContext = new SparkContext(conf)
> 
> 
> val rdd = sc.cassandraTable("mydb", "mytable")  // this works
> 
> But:
> 
> val cc = new CassandraSQLContext(sc)
> 
>  cc.setKeyspace("mydb")
> 
>  val srdd: SchemaRDD = cc.sql("select * from mydb.mytable ")   
> 
> 
> println ("count : " +  srdd.count) // does not work
> 
> Exception is thrown:
> 
> Exception in thread "main" 
> com.google.common.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: mydb3.inverseeventtype
> 
> at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
> 
> at com.google.common.cache.LocalCache.get(LocalCache.java:3934)
> 
> 
> at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)
> 
> 
> 
> 
> 
> in fact mydb3 is anothery keyspace which I did not tried even to connect to 
> it !  
> 
> 
> 
> Any idea?
> 
> 
> 
> best,
> 
> /Shahab
> 
> 
> 
> Here is how my SBT looks like: 
> 
> libraryDependencies ++= Seq(
> 
> "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0-beta1" 
> withSources() withJavadoc(),
> 
> "org.apache.cassandra" % "cassandra-all" % "2.0.9" intransitive(),
> 
> "org.apache.cassandra" % "cassandra-thrift" % "2.0.9" intransitive(),
> 
> "net.jpountz.lz4" % "lz4" % "1.2.0",
> 
> "org.apache.thrift" % "libthrift" % "0.9.1" exclude("org.slf4j", 
> "slf4j-api") exclude("javax.servlet", "servlet-api"),
> 
> "com.datastax.cassandra" % "cassandra-driver-core" % "2.0.4" 
> intransitive(),
> 
> "org.apache.spark" %% "spark-core" % "1.1.0" % "provided" 
> exclude("org.apache.hadoop", "hadoop-core"),
> 
> "org.apache.spark" %% "spark-streaming" % "1.1.0" % "provided",
> 
> "org.apache.hadoop" % "hadoop-client" % "1.0.4" % "provided",
> 
> "com.github.nscala-time" %% "nscala-time" % "1.0.0",
> 
> "org.scalatest" %% "scalatest" % "1.9.1" % "test",
> 
> "org.apache.spark" %% "spark-sql" % "1.1.0" %  "provided",
> 
> "org.apache.spark" %% "spark-hive" % "1.1.0" % "provided",
> 
> "org.json4s" %% "json4s-jackson" % "3.2.5",
> 
> "junit" % "junit" % "4.8.1" % "test",
> 
> "org.slf4j" % "slf4j-api" % "1.7.7",
> 
> "org.slf4j" % "slf4j-simple" % "1.7.7",
> 
> "org.clapper" %% "grizzled-slf4j" % "1.0.2",
> 
> 
> "log4j" % "log4j" % "1.2.17")
> 



Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Hi,

I am using the latest Cassandra-Spark Connector  to access Cassandra tables
form Spark. While I successfully managed to connect Cassandra using
CassandraRDD, the similar SparkSQL approach does not work. Here is my code
for both methods:

import com.datastax.spark.connector._

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.sql._;

import org.apache.spark.SparkContext._

import org.apache.spark.sql.catalyst.expressions._

import com.datastax.spark.connector.cql.CassandraConnector

import org.apache.spark.sql.cassandra.CassandraSQLContext


  val conf = new SparkConf().setAppName("SomethingElse")

   .setMaster("local")

.set("spark.cassandra.connection.host", "localhost")

val sc: SparkContext = new SparkContext(conf)

  val rdd = sc.cassandraTable("mydb", "mytable")  // this works

But:

val cc = new CassandraSQLContext(sc)

 cc.setKeyspace("mydb")

 val srdd: SchemaRDD = cc.sql("select * from mydb.mytable ")

println ("count : " +  srdd.count) // does not work

Exception is thrown:

Exception in thread "main"
com.google.common.util.concurrent.UncheckedExecutionException:
java.util.NoSuchElementException: key not found: mydb3.inverseeventtype

at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)

at com.google.common.cache.LocalCache.get(LocalCache.java:3934)

 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)




in fact mydb3 is anothery keyspace which I did not tried even to connect to
it !


Any idea?


best,

/Shahab


Here is how my SBT looks like:

libraryDependencies ++= Seq(

"com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0-beta1"
withSources() withJavadoc(),

"org.apache.cassandra" % "cassandra-all" % "2.0.9" intransitive(),

"org.apache.cassandra" % "cassandra-thrift" % "2.0.9" intransitive(),

"net.jpountz.lz4" % "lz4" % "1.2.0",

"org.apache.thrift" % "libthrift" % "0.9.1" exclude("org.slf4j", "slf4j-
api") exclude("javax.servlet", "servlet-api"),

"com.datastax.cassandra" % "cassandra-driver-core" % "2.0.4"
intransitive(),

"org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
exclude("org.apache.hadoop", "hadoop-core"),

"org.apache.spark" %% "spark-streaming" % "1.1.0" % "provided",

"org.apache.hadoop" % "hadoop-client" % "1.0.4" % "provided",

"com.github.nscala-time" %% "nscala-time" % "1.0.0",

"org.scalatest" %% "scalatest" % "1.9.1" % "test",

"org.apache.spark" %% "spark-sql" % "1.1.0" %  "provided",

"org.apache.spark" %% "spark-hive" % "1.1.0" % "provided",

"org.json4s" %% "json4s-jackson" % "3.2.5",

"junit" % "junit" % "4.8.1" % "test",

"org.slf4j" % "slf4j-api" % "1.7.7",

"org.slf4j" % "slf4j-simple" % "1.7.7",

"org.clapper" %% "grizzled-slf4j" % "1.0.2",

"log4j" % "log4j" % "1.2.17")


Re: SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Michael Armbrust
Hmmm, this looks like a bug.  Can you file a JIRA?

On Thu, Oct 30, 2014 at 4:04 PM, Jean-Pascal Billaud 
wrote:

> Hi,
>
> While testing SparkSQL on top of our Hive metastore, I am getting
> some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
> table.
>
> Basically, I have a table "mtable" partitioned by some "date" field in
> hive and below is the scala code I am running in spark-shell:
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
> val rdd_mtable = sqlContext.sql("select * from mtable where
> date=20141028");
> rdd_mtable.registerTempTable("rdd_mtable");
> sqlContext.cacheTable("rdd_mtable");
> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- OK
> sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- Exception
>
> So the first collect() is working just fine, however running the second
> collect() which I expect use the cached RDD throws some
> java.lang.ArrayIndexOutOfBoundsException, see the backtrace at the end of
> this email. It seems the columnar traversal is crashing for some reasons.
> FYI, I am using spark ToT (234de9232bcfa212317a8073c4a82c3863b36b14).
>
> java.lang.ArrayIndexOutOfBoundsException: 14
> at
> org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
> at
> org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
> at
> org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
> at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:89)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73)
> at
> org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:122)
> at
> org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:119)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
> at
> org.apache.spark.sql.CacheManager$class.useCachedData(CacheManager.scala:119)
> at org.apache.spark.sql.SQLContext.useCachedData(SQLContext.scala:49)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute

SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Jean-Pascal Billaud
Hi,

While testing SparkSQL on top of our Hive metastore, I am getting
some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
table.

Basically, I have a table "mtable" partitioned by some "date" field in hive
and below is the scala code I am running in spark-shell:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val rdd_mtable = sqlContext.sql("select * from mtable where date=20141028");
rdd_mtable.registerTempTable("rdd_mtable");
sqlContext.cacheTable("rdd_mtable");
sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- OK
sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- Exception

So the first collect() is working just fine, however running the second
collect() which I expect use the cached RDD throws some
java.lang.ArrayIndexOutOfBoundsException, see the backtrace at the end of
this email. It seems the columnar traversal is crashing for some reasons.
FYI, I am using spark ToT (234de9232bcfa212317a8073c4a82c3863b36b14).

java.lang.ArrayIndexOutOfBoundsException: 14
at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
at
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
at
org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:89)
at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66)
at
org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87)
at
org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73)
at
org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:122)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:119)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at
org.apache.spark.sql.CacheManager$class.useCachedData(CacheManager.scala:119)
at org.apache.spark.sql.SQLContext.useCachedData(SQLContext.scala:49)
at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:376)
at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:376)
at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:377)
at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:377)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:382)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:380)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:386)
at

Re: SparkSQL: Nested Query error

2014-10-29 Thread SK
Hi,

 I am getting an error in the Query Plan when I use the SQL statement
exactly as you have suggested. Is that the exact SQL statement I should be
using (I am not very familiar with SQL syntax)?


I also tried using the SchemaRDD's subtract method to perform this query.
usersRDD.subtract(deviceRDD).count(). The count comes out to be  1, but
there are many UIDs in "tusers" that are not in "device" - so the result is
not correct. 

I would like to know the right way to do frame this query in SparkSQL.

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-Query-error-tp17691p17705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: Nested Query error

2014-10-29 Thread Sanjiv Mittal
You may use -

select count(u_uid) from tusers a left outer join device b on (a.u_uid =
 b.d_uid) where b.d_uid is null

On Wed, Oct 29, 2014 at 5:45 PM, SK  wrote:

> Hi,
>
> I am using Spark 1.1.0. I have the following SQL statement where I am
> trying
> to count the number of UIDs that are in the "tusers" table but not in the
> "device" table.
>
> val users_with_no_device = sql_cxt.sql("SELECT COUNT (u_uid) FROM tusers
> WHERE tusers.u_uid NOT IN (SELECT d_uid FROM device)")
>
> I am getting the following error:
> Exception in thread "main" java.lang.RuntimeException: [1.61] failure:
> string literal expected
> SELECT COUNT (u_uid) FROM tusers WHERE tusers.u_uid NOT IN (SELECT d_uid
> FROM device)
>
> I am not sure if every subquery has to be a string, so I tried to enclose
> the subquery as a  string literal as follows:
> val users_with_no_device = sql_cxt.sql("SELECT COUNT (u_uid) FROM tusers
> WHERE tusers.u_uid NOT IN ("SELECT d_uid FROM device")")
> But that resulted in a compilation error.
>
> What is the right way to frame the above query in Spark SQL?
>
> thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-Query-error-tp17691.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


SparkSQL: Nested Query error

2014-10-29 Thread SK
Hi,

I am using Spark 1.1.0. I have the following SQL statement where I am trying
to count the number of UIDs that are in the "tusers" table but not in the
"device" table.

val users_with_no_device = sql_cxt.sql("SELECT COUNT (u_uid) FROM tusers
WHERE tusers.u_uid NOT IN (SELECT d_uid FROM device)")

I am getting the following error:
Exception in thread "main" java.lang.RuntimeException: [1.61] failure:
string literal expected
SELECT COUNT (u_uid) FROM tusers WHERE tusers.u_uid NOT IN (SELECT d_uid
FROM device)

I am not sure if every subquery has to be a string, so I tried to enclose
the subquery as a  string literal as follows: 
val users_with_no_device = sql_cxt.sql("SELECT COUNT (u_uid) FROM tusers
WHERE tusers.u_uid NOT IN ("SELECT d_uid FROM device")")
But that resulted in a compilation error.

What is the right way to frame the above query in Spark SQL?

thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-Query-error-tp17691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD to Multiple Tables SparkSQL

2014-10-28 Thread critikaled
Hi oliver,
thanks for the answer I don't have the information of all keys before hand,
the reason i want to have multiple tables is that based on my information on
known key I will apply different queries get the results for that particular
key I don't want to touch the unkown ones I'll save that for later. My idea
is that if every key has and individual table the querying will be faster.
now if I have a single table I'll do "select id from kv where key
='some_key' and value operator 'some_thing' ". I want to make it "select id
from some_key where value operator 'some_thing' ". BTW what do you mean by
"extract" could you direct me to api or code sample.

thanks and regards,
critikaled.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807p17536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo
It works, thanks very much



Zhanfeng Huo
 
From: Yanbo Liang
Date: 2014-10-28 18:50
To: Zhanfeng Huo
CC: user
Subject: Re: SparkSql OutOfMemoryError
Try to increase the driver memory.

2014-10-28 17:33 GMT+08:00 Zhanfeng Huo :
Hi,friends:

I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when 
data is large. So how to tune it ?

spark-defaults.conf:

spark.shuffle.consolidateFiles true 
spark.shuffle.manager SORT
spark.akka.threads 4
spark.sql.inMemoryColumnarStorage.compressed true
spark.io.compression.codec lz4

cmds:

./spark-sql cluster --master spark://master:7077 --conf 
spark.akka.frameSize=2000 --executor-memory 30g --total-executor-cores 300 
--driver-class-path /spark/default/lib/mysql-connector-java-5.1.21.jar

cmds in command line:

SET spark.serializer=org.apache.spark.serializer.KryoSerializer; 
SET spark.sql.shuffle.partitions=1;
select sum(uv) as suv, sum(vv) as svv from (select 1 as uv, count(1) as vv 
from test ta  group by cookieid) t1;




14/10/28 14:42:52 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem 
[sparkDriver] 
java.lang.OutOfMemoryError: Java heap space 
at com.google.protobuf_spark.ByteString.copyFrom(ByteString.java:90) 
at com.google.protobuf_spark.ByteString.copyFrom(ByteString.java:99) 
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36) 
at 
akka.remote.EndpointWriter$$anonfun$akka$remote$EndpointWriter$$serializeMessage$1.apply(Endpoint.scala:672)
 
at 
akka.remote.EndpointWriter$$anonfun$akka$remote$EndpointWriter$$serializeMessage$1.apply(Endpoint.scala:672)
 
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 
at 
akka.remote.EndpointWriter.akka$remote$EndpointWriter$$serializeMessage(Endpoint.scala:671)
 
at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:559) 
at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:544) 
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 
at akka.actor.FSM$class.processEvent(FSM.scala:595) 
at akka.remote.EndpointWriter.processEvent(Endpoint.scala:443) 
at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:589) 
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:583) 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
14/10/28 14:42:55 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-36] shutting down ActorSystem 
[sparkDriver] 
java.lang.OutOfMemoryError: Java heap space


Zhanfeng Huo



Re: SparkSql OutOfMemoryError

2014-10-28 Thread Yanbo Liang
Try to increase the driver memory.

2014-10-28 17:33 GMT+08:00 Zhanfeng Huo :

> Hi,friends:
>
> I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails
> when data is large. So how to tune it ?
>
> spark-defaults.conf:
>
> spark.shuffle.consolidateFiles true
> spark.shuffle.manager SORT
> spark.akka.threads 4
> spark.sql.inMemoryColumnarStorage.compressed true
> spark.io.compression.codec lz4
>
> cmds:
>
> ./spark-sql cluster --master spark://master:7077 --conf
> spark.akka.frameSize=2000 --executor-memory 30g --total-executor-cores 300
> --driver-class-path /spark/default/lib/mysql-connector-java-5.1.21.jar
>
> cmds in command line:
>
> SET spark.serializer=org.apache.spark.serializer.KryoSerializer;
> SET spark.sql.shuffle.partitions=1;
> select sum(uv) as suv, sum(vv) as svv from (select 1 as uv, count(1)
> as vv from test ta  group by cookieid) t1;
>
>
>
>
> 14/10/28 14:42:52 ERROR ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem
> [sparkDriver]
> java.lang.OutOfMemoryError: Java heap space
> at com.google.protobuf_spark.ByteString.copyFrom(ByteString.java:90)
> at com.google.protobuf_spark.ByteString.copyFrom(ByteString.java:99)
> at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
> at
> akka.remote.EndpointWriter$$anonfun$akka$remote$EndpointWriter$$serializeMessage$1.apply(Endpoint.scala:672)
>
> at
> akka.remote.EndpointWriter$$anonfun$akka$remote$EndpointWriter$$serializeMessage$1.apply(Endpoint.scala:672)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> akka.remote.EndpointWriter.akka$remote$EndpointWriter$$serializeMessage(Endpoint.scala:671)
>
> at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:559)
> at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:544)
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>
> at akka.actor.FSM$class.processEvent(FSM.scala:595)
> at akka.remote.EndpointWriter.processEvent(Endpoint.scala:443)
> at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:589)
> at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:583)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> 14/10/28 14:42:55 ERROR ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.actor.default-dispatcher-36] shutting down ActorSystem
> [sparkDriver]
> java.lang.OutOfMemoryError: Java heap space
> --
> Zhanfeng Huo
>


SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo
Hi,friends:

I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when 
data is large. So how to tune it ?

spark-defaults.conf:

spark.shuffle.consolidateFiles true 
spark.shuffle.manager SORT
spark.akka.threads 4
spark.sql.inMemoryColumnarStorage.compressed true
spark.io.compression.codec lz4

cmds:

./spark-sql cluster --master spark://master:7077 --conf 
spark.akka.frameSize=2000 --executor-memory 30g --total-executor-cores 300 
--driver-class-path /spark/default/lib/mysql-connector-java-5.1.21.jar

cmds in command line:

SET spark.serializer=org.apache.spark.serializer.KryoSerializer; 
SET spark.sql.shuffle.partitions=1;
select sum(uv) as suv, sum(vv) as svv from (select 1 as uv, count(1) as vv 
from test ta  group by cookieid) t1;




14/10/28 14:42:52 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem 
[sparkDriver] 
java.lang.OutOfMemoryError: Java heap space 
at com.google.protobuf_spark.ByteString.copyFrom(ByteString.java:90) 
at com.google.protobuf_spark.ByteString.copyFrom(ByteString.java:99) 
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36) 
at 
akka.remote.EndpointWriter$$anonfun$akka$remote$EndpointWriter$$serializeMessage$1.apply(Endpoint.scala:672)
 
at 
akka.remote.EndpointWriter$$anonfun$akka$remote$EndpointWriter$$serializeMessage$1.apply(Endpoint.scala:672)
 
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 
at 
akka.remote.EndpointWriter.akka$remote$EndpointWriter$$serializeMessage(Endpoint.scala:671)
 
at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:559) 
at akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:544) 
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 
at akka.actor.FSM$class.processEvent(FSM.scala:595) 
at akka.remote.EndpointWriter.processEvent(Endpoint.scala:443) 
at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:589) 
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:583) 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
14/10/28 14:42:55 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-36] shutting down ActorSystem 
[sparkDriver] 
java.lang.OutOfMemoryError: Java heap space


Zhanfeng Huo


Re: 答复: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Hm, these DDLs look quite normal. Forgot to ask what version of Spark 
SQL are you using? Can you reproduce this issue with the master branch?


Also, a small sample of input data that can reproduce this issue can be 
very helpful. For example, you can run SELECT * FROM tbl TABLESAMPLE (10 
ROWS) against all tables involved in your query.


On 10/27/14 4:51 PM, lyf刘钰帆 wrote:


Yes. All data are located in the HDFS.

CREATE TABLE tblDate(

dateID string,

theyearmonth string,

theyear string,

themonth string,

thedate string,

theweek string,

theweeks string,

thequot string,

thetenday string,

thehalfmonth string

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

CREATE TABLE tblA(

number STRING,

locationid STRING,

dateID string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

CREATE TABLE tblB (

number STRING,

rownum int,

itemid STRING,

qty INT,

price int,

amount int

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

LOAD DATA LOCAL INPATH '/home/data/testFolder/qryTheDate.txt' INTO 
TABLE tblDate;


LOAD DATA LOCAL INPATH '/home/data/testFolder/qrytblA.txt' INTO TABLE 
tblA;


LOAD DATA LOCAL INPATH '/home/data/testFolder/qrytblB.txt' INTO TABLE 
tblB;


*发件人:*Cheng Lian [mailto:lian.cs@gmail.com]
*发 送时间:*2014年10月27日16:48
*收件人:*lyf刘钰帆; user@spark.apache.org
*主题:*Re: SparkSQL display wrong result

Would you mind to share DDLs of all involved tables? What format are 
these tables stored in? Is this issue specific to this query? I guess 
Hive, Shark and Spark SQL all read from the same HDFS dataset?


On 10/27/14 3:45 PM, lyf刘钰帆wrote:

Hi,

I am using SparkSQL 1.1.0 with cdh 4.6.0 recently, however, the
SparkSQL will print wired results than Hive(shark)

The SQL:

select c.theyear, sum(b.amount) from tblA a join tblB b on
a.number = b.number join tbldate c on a.dateid = c.dateid group by
c.theyear order by theyear;

*where *

dateID string,

theyear string,

amount int

number STRING,

Result by Shark:

+--+---+

| theyear  |_c1 |

+--+---+

| 2004 | *1403018* |

| 2005 | 5557850 |

| 2006 | 7203061 |

| 2007 | 11300432 |

| 2008 | 12109328 |

| 2009 | 5365447 |

| 2010 | 188944 |

+--+---+

Result by Hive:

theyear_c1

2004 *1403018*

2005 5557850

2006 7203061

2007 11300432

    2008     12109328

2009 5365447

2010 188944

Result by SparkSQL:

+--+---+

| theyear  |c_1 |

+--+---+

| 2004 | *3265696* |

| 2005 | 13247234 |

| 2006 | 13670416 |

| 2007 | 16711974 |

| 2008 | 14670698 |

| 2009 | 6322137 |

| 2010 | 210924 |

+--+---+

Best regards

Patrick Liu刘 钰帆| 1#3F122





Re: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Would you mind to share DDLs of all involved tables? What format are 
these tables stored in? Is this issue specific to this query? I guess 
Hive, Shark and Spark SQL all read from the same HDFS dataset?



On 10/27/14 3:45 PM, lyf刘钰帆 wrote:


Hi,

I am using SparkSQL 1.1.0 with cdh 4.6.0 recently, however, the 
SparkSQL will print wired results than Hive(shark)


The SQL:

select c.theyear, sum(b.amount) from tblA a join tblB b on a.number = 
b.number join tbldate c on a.dateid = c.dateid group by c.theyear 
order by theyear;


*where *

dateID string,

theyear string,

amount int

number STRING,

Result by Shark:

+--+---+

| theyear  |_c1|

+--+---+

| 2004 | *1403018* |

| 2005 | 5557850   |

| 2006 | 7203061   |

| 2007 | 11300432  |

| 2008 | 12109328  |

| 2009 | 5365447   |

| 2010 | 188944|

+--+---+

Result by Hive:

theyear_c1

2004 *1403018*

2005 5557850

2006 7203061

2007 11300432

2008 12109328

2009 5365447

2010 188944

Result by SparkSQL:

+--+---+

| theyear  |c_1|

+--+---+

| 2004 | *3265696* |

| 2005 | 13247234  |

| 2006 | 13670416  |

| 2007 | 16711974  |

| 2008 | 14670698  |

| 2009 | 6322137   |

| 2010 | 210924|

+--+---+

Best regards

Patrick Liu刘 钰帆| 1#3F122





Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
This is very experimental and mostly unsupported, but you can start the
JDBC server from within your own programs
<https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45>
by passing it the HiveContext.

On Fri, Oct 24, 2014 at 2:07 PM, ankits  wrote:

> Thanks for your response Michael.
>
> I'm still not clear on all the details - in particular, how do I take a
> temp
> table created from a SchemaRDD and allow it to be queried using the Thrift
> JDBC server? From the Hive guides, it looks like it only supports loading
> data from files, but I want to query tables stored in memory only via JDBC.
> Is that possible?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196p17235.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
Thanks for your response Michael.

I'm still not clear on all the details - in particular, how do I take a temp
table created from a SchemaRDD and allow it to be queried using the Thrift
JDBC server? From the Hive guides, it looks like it only supports loading
data from files, but I want to query tables stored in memory only via JDBC.
Is that possible?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196p17235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
That works perfect. Thanks again Michael

On Fri, Oct 24, 2014 at 3:10 PM, Michael Armbrust 
wrote:

> It won't be transparent, but you can do so something like:
>
> CACHE TABLE newData AS SELECT * FROM allData WHERE date > "..."
>
> and then query newData.
>
> On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood 
> wrote:
>
>> Is there a way to cache certain (or most latest) partitions of certain
>> tables ?
>>
>> On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust > > wrote:
>>
>>> It does have support for caching using either CACHE TABLE  or
>>> CACHE TABLE  AS SELECT 
>>>
>>> On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:
>>>
>>>> I want to set up spark SQL to allow ad hoc querying over the last X
>>>> days of
>>>> processed data, where the data is processed through spark. This would
>>>> also
>>>> have to cache data (in memory only), so the approach I was thinking of
>>>> was
>>>> to build a layer that persists the appropriate RDDs and stores them in
>>>> memory.
>>>>
>>>> I see spark sql allows ad hoc querying through JDBC though I have never
>>>> used
>>>> that before. Will using JDBC offer any advantages (e.g does it have
>>>> built in
>>>> support for caching?) over rolling my own solution for this use case?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
It won't be transparent, but you can do so something like:

CACHE TABLE newData AS SELECT * FROM allData WHERE date > "..."

and then query newData.

On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood  wrote:

> Is there a way to cache certain (or most latest) partitions of certain
> tables ?
>
> On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust 
> wrote:
>
>> It does have support for caching using either CACHE TABLE  or
>> CACHE TABLE  AS SELECT 
>>
>> On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:
>>
>>> I want to set up spark SQL to allow ad hoc querying over the last X days
>>> of
>>> processed data, where the data is processed through spark. This would
>>> also
>>> have to cache data (in memory only), so the approach I was thinking of
>>> was
>>> to build a layer that persists the appropriate RDDs and stores them in
>>> memory.
>>>
>>> I see spark sql allows ad hoc querying through JDBC though I have never
>>> used
>>> that before. Will using JDBC offer any advantages (e.g does it have
>>> built in
>>> support for caching?) over rolling my own solution for this use case?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
Is there a way to cache certain (or most latest) partitions of certain
tables ?

On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust 
wrote:

> It does have support for caching using either CACHE TABLE  or
> CACHE TABLE  AS SELECT 
>
> On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:
>
>> I want to set up spark SQL to allow ad hoc querying over the last X days
>> of
>> processed data, where the data is processed through spark. This would also
>> have to cache data (in memory only), so the approach I was thinking of was
>> to build a layer that persists the appropriate RDDs and stores them in
>> memory.
>>
>> I see spark sql allows ad hoc querying through JDBC though I have never
>> used
>> that before. Will using JDBC offer any advantages (e.g does it have built
>> in
>> support for caching?) over rolling my own solution for this use case?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
It does have support for caching using either CACHE TABLE  or
CACHE TABLE  AS SELECT 

On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:

> I want to set up spark SQL to allow ad hoc querying over the last X days of
> processed data, where the data is processed through spark. This would also
> have to cache data (in memory only), so the approach I was thinking of was
> to build a layer that persists the appropriate RDDs and stores them in
> memory.
>
> I see spark sql allows ad hoc querying through JDBC though I have never
> used
> that before. Will using JDBC offer any advantages (e.g does it have built
> in
> support for caching?) over rolling my own solution for this use case?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Aniket Bhatnagar
Just curious... Why would you not store the processed results in regular
relational database? Not sure what you meant by persist the appropriate
RDDs. Did you mean output of your job will be RDDs?

On 24 October 2014 13:35, ankits  wrote:

> I want to set up spark SQL to allow ad hoc querying over the last X days of
> processed data, where the data is processed through spark. This would also
> have to cache data (in memory only), so the approach I was thinking of was
> to build a layer that persists the appropriate RDDs and stores them in
> memory.
>
> I see spark sql allows ad hoc querying through JDBC though I have never
> used
> that before. Will using JDBC offer any advantages (e.g does it have built
> in
> support for caching?) over rolling my own solution for this use case?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
I want to set up spark SQL to allow ad hoc querying over the last X days of
processed data, where the data is processed through spark. This would also
have to cache data (in memory only), so the approach I was thinking of was
to build a layer that persists the appropriate RDDs and stores them in
memory.

I see spark sql allows ad hoc querying through JDBC though I have never used
that before. Will using JDBC offer any advantages (e.g does it have built in
support for caching?) over rolling my own solution for this use case?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL , best way to divide data into partitions?

2014-10-23 Thread Michael Armbrust
Spark SQL now supports Hive style dynamic partitioning:
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions

This is a new feature so you'll have to build master or wait for 1.2.

On Wed, Oct 22, 2014 at 7:03 PM, raymond  wrote:

> Hi
>
> I have a json file that can be load by sqlcontext.jsonfile into a
> table. but this table is not partitioned.
>
> Then I wish to transform this table into a partitioned table say
> on field “date” etc. what will be the best approaching to do this?  seems
> in hive this is usually done by load data into a dedicated partition
> directly. but if I don’t want to select data out by a specific partition
> then insert it with each partition field value. How should I do it in a
> quick way? And how to do it in Spark sql?
>
> raymond
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


SparkSQL and columnar data

2014-10-23 Thread Marius Soutier
Hi guys,

another question: what’s the approach to working with column-oriented data, 
i.e. data with more than 1000 columns. Using Parquet for this should be fine, 
but how well does SparkSQL handle the big amount of columns? Is there a limit? 
Should we use standard Spark instead?

Thanks for any insights,
- Marius


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL , best way to divide data into partitions?

2014-10-22 Thread raymond
Hi

I have a json file that can be load by sqlcontext.jsonfile into a 
table. but this table is not partitioned.

Then I wish to transform this table into a partitioned table say on 
field “date” etc. what will be the best approaching to do this?  seems in hive 
this is usually done by load data into a dedicated partition directly. but if I 
don’t want to select data out by a specific partition then insert it with each 
partition field value. How should I do it in a quick way? And how to do it in 
Spark sql?

raymond
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Ok thanks Michael.

In general, what's the easy way to figure out what's already implemented?

The exception I was getting was not really helpful here?

Also, is there a roadmap document somewhere ?

Thanks!

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909p16942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-21 Thread Terry Siu
Just to follow up, the queries worked against master and I got my whole flow 
rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with 
the next release of CDH5  :P

-Terry

From: Terry Siu mailto:terry@smartfocus.com>>
Date: Monday, October 20, 2014 at 12:22 PM
To: Michael Armbrust mailto:mich...@databricks.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Hi Michael,

Thanks again for the reply. Was hoping it was something I was doing wrong in 
1.1.0, but I’ll try master.

Thanks,
-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 20, 2014 at 12:11 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Have you tried this on master?  There were several problems with resolution of 
complex queries that were registered as tables in the 1.1.0 release.

On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf("spark.sql.hive.convertMetastoreParquet", "true”)

hc.setConf("spark.sql.parquet.compression.codec", "snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some("sc.segcustomerid".attr==="st.customerid".attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: st

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Michael Armbrust
No, analytic and window functions do not work yet.

On Tue, Oct 21, 2014 at 3:00 AM, Pierre B <
pierre.borckm...@realimpactanalytics.com> wrote:

> Hi!
>
> The RANK function is available in hive since version 0.11.
> When trying to use it in SparkSQL, I'm getting the following exception
> (full
> stacktrace below):
> java.lang.ClassCastException:
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
> cast to
>
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
>
> Is this function supposed to be available?
>
> Thanks
>
> P.
>
> ---
>
>
> java.lang.ClassCastException:
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
> cast to
>
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
> at
> org.apache.spark.sql.hive.HiveUdafFunction.(hiveUdfs.scala:334)
> at
> org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233)
> at
> org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207)
> at
> org.apache.spark.sql.execution.Aggregate.org
> $apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97)
> at
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
> at
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


[SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Hi!

The RANK function is available in hive since version 0.11.
When trying to use it in SparkSQL, I'm getting the following exception (full
stacktrace below):
java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer

Is this function supposed to be available?

Thanks

P.

---


java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
at org.apache.spark.sql.hive.HiveUdafFunction.(hiveUdfs.scala:334)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207)
at
org.apache.spark.sql.execution.Aggregate.org$apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD to Multiple Tables SparkSQL

2014-10-21 Thread Olivier Girardot
If you already know your keys the best way would be to "extract"
one RDD per key (it would not bring the content back to the master and you
can take advantage of the caching features) and then execute a
registerTempTable by Key.

But I'm guessing, you don't know the keys in advance, and in this case, I
think it becomes a very confusing point to put everything in different
tables,
First of all - how would you query it afterwards ?

Regards,

Olivier.

2014-10-20 13:02 GMT+02:00 critikaled :

> Hi I have a rdd which I want to register as multiple tables based on key
>
> 
> val context = new SparkContext(conf)
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(context)
> import sqlContext.createSchemaRDD
>
> case class KV(key:String,id:String,value:String)
> val logsRDD = context.textFile("logs", 10).map{line=>
>   val Array(key,id,value) = line split ' '
>   (key,id,value)
> }.registerTempTable("KVS")
>
> I want to store the above information to multiple tables based on key
> without bringing the entire data to master
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi Michael,

Thanks again for the reply. Was hoping it was something I was doing wrong in 
1.1.0, but I’ll try master.

Thanks,
-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 20, 2014 at 12:11 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Have you tried this on master?  There were several problems with resolution of 
complex queries that were registered as tables in the 1.1.0 release.

On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf("spark.sql.hive.convertMetastoreParquet", "true”)

hc.setConf("spark.sql.parquet.compression.codec", "snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some("sc.segcustomerid".attr==="st.customerid".attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: string (nullable = true)

 |-- customer_familystatus: string (nullable = true)

 |-- customer_customertype: string (nullable = true)

 |-- customer_city: string (nullable = true)

 |-- customer_country: string (nullable = true)

 |-- customer_state: string (nullable = true)

 |-- customer_region: string (nullable = true)

 |-- customer_customergroup: string (nullable = true)

 |-- customer_maritalstatus: string (nullable = true)

 |-- customer_agerange: string (nullable = true)

 |-- customer_zip: string (nullable = true

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Michael Armbrust
Have you tried this on master?  There were several problems with resolution
of complex queries that were registered as tables in the 1.1.0 release.

On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu 
wrote:

>  Hi all,
>
>  I’m getting a TreeNodeException for unresolved attributes when I do a
> simple select from a schemaRDD generated by a join in Spark 1.1.0. A little
> background first. I am using a HiveContext (against Hive 0.12) to grab two
> tables, join them, and then perform multiple INSERT-SELECT with GROUP BY to
> write back out to a Hive rollup table that has two partitions. This task is
> an effort to simulate the unsupported GROUPING SETS functionality in
> SparkSQL.
>
>  In my first attempt, I got really close using  SchemaRDD.groupBy until I
> realized that SchemaRDD.insertTo API does not support partitioned tables
> yet. This prompted my second attempt to pass in SQL to the HiveContext.sql
> API instead.
>
>  Here’s a rundown of the commands I executed on the spark-shell:
>
>   val hc = new HiveContext(sc)
>
> hc.setConf("spark.sql.hive.convertMetastoreParquet", "true”)
>
> hc.setConf("spark.sql.parquet.compression.codec", "snappy”)
>
>
>  // For implicit conversions to Expression
>
> val sqlContext = new SQLContext(sc)
>
>  import sqlContext._
>
>
>  val segCusts = hc.hql(“select …”)
>
> val segTxns = hc.hql(“select …”)
>
>
>  val sc = segCusts.as('sc)
>
>  val st = segTxns.as(‘st)
>
>
>  // Join the segCusts and segTxns tables
>
> val rup = sc.join(st, Inner,
> Some("sc.segcustomerid".attr==="st.customerid".attr))
>
> rup.registerAsTable(“rupbrand”)
>
>
>
>  If I do a printSchema on the rup, I get:
>
> root
>
>  |-- segcustomerid: string (nullable = true)
>
>  |-- sales: double (nullable = false)
>
>  |-- tx_count: long (nullable = false)
>
>  |-- storeid: string (nullable = true)
>
>  |-- transdate: long (nullable = true)
>
>  |-- transdate_ts: string (nullable = true)
>
>  |-- transdate_dt: string (nullable = true)
>
>  |-- unitprice: double (nullable = true)
>
>  |-- translineitem: string (nullable = true)
>
>  |-- offerid: string (nullable = true)
>
>  |-- customerid: string (nullable = true)
>
>  |-- customerkey: string (nullable = true)
>
>  |-- sku: string (nullable = true)
>
>  |-- quantity: double (nullable = true)
>
>  |-- returnquantity: double (nullable = true)
>
>  |-- channel: string (nullable = true)
>
>  |-- unitcost: double (nullable = true)
>
>  |-- transid: string (nullable = true)
>
>  |-- productid: string (nullable = true)
>
>  |-- id: string (nullable = true)
>
>  |-- campaign_campaigncost: double (nullable = true)
>
>  |-- campaign_begindate: long (nullable = true)
>
>  |-- campaign_begindate_ts: string (nullable = true)
>
>  |-- campaign_begindate_dt: string (nullable = true)
>
>  |-- campaign_enddate: long (nullable = true)
>
>  |-- campaign_enddate_ts: string (nullable = true)
>
>  |-- campaign_enddate_dt: string (nullable = true)
>
>  |-- campaign_campaigntitle: string (nullable = true)
>
>  |-- campaign_campaignname: string (nullable = true)
>
>  |-- campaign_id: string (nullable = true)
>
>  |-- product_categoryid: string (nullable = true)
>
>  |-- product_company: string (nullable = true)
>
>  |-- product_brandname: string (nullable = true)
>
>  |-- product_vendorid: string (nullable = true)
>
>  |-- product_color: string (nullable = true)
>
>  |-- product_brandid: string (nullable = true)
>
>  |-- product_description: string (nullable = true)
>
>  |-- product_size: string (nullable = true)
>
>  |-- product_subcategoryid: string (nullable = true)
>
>  |-- product_departmentid: string (nullable = true)
>
>  |-- product_productname: string (nullable = true)
>
>  |-- product_categoryname: string (nullable = true)
>
>  |-- product_vendorname: string (nullable = true)
>
>  |-- product_sku: string (nullable = true)
>
>  |-- product_subcategoryname: string (nullable = true)
>
>  |-- product_status: string (nullable = true)
>
>  |-- product_departmentname: string (nullable = true)
>
>  |-- product_style: string (nullable = true)
>
>  |-- product_id: string (nullable = true)
>
>  |-- customer_lastname: string (nullable = true)
>
>  |-- customer_familystatus: string (nullable = true)
>
>  |-- customer_customertype: string (nullable = true)
>
>  |-- customer_city: string (nullable = true)
>
>  |-- customer_country: string (nullable = true)
>
>  |-- customer_state: string (nullable = true)
>
>  |-- customer_region: string 

SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf("spark.sql.hive.convertMetastoreParquet", "true”)

hc.setConf("spark.sql.parquet.compression.codec", "snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some("sc.segcustomerid".attr==="st.customerid".attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: string (nullable = true)

 |-- customer_familystatus: string (nullable = true)

 |-- customer_customertype: string (nullable = true)

 |-- customer_city: string (nullable = true)

 |-- customer_country: string (nullable = true)

 |-- customer_state: string (nullable = true)

 |-- customer_region: string (nullable = true)

 |-- customer_customergroup: string (nullable = true)

 |-- customer_maritalstatus: string (nullable = true)

 |-- customer_agerange: string (nullable = true)

 |-- customer_zip: string (nullable = true)

 |-- customer_age: double (nullable = true)

 |-- customer_address2: string (nullable = true)

 |-- customer_incomerange: string (nullable = true)

 |-- customer_gender: string (nullable = true)

 |-- customer_customerkey: string (nullable = true)

 |-- customer_address1: string (nullable = true)

 |-- customer_email: string (nullable = true)

 |-- customer_education: string (nullable = true)

 |-- customer_birthdate: long (nullable = true)

 |-- customer_birthdate_ts: string (nullable = true)

 |-- customer_birthdate_dt: string (nullable = true)

 |-- customer_id: string (nullable = true)

 |-- customer_firstname: string (nullable = true)

 |-- transnum: long (nullable = true)

 |-- transmonth: string (nullable = true)


Nothi

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-20 Thread Terry Siu
Hi Yin,

Sorry for the delay, but I’ll try the code change when I get a chance, but 
Michael’s initial response did solve my problem. In the meantime, I’m hitting 
another issue with SparkSQL which I will probably post another message if I 
can’t figure a workaround.

Thanks,
-Terry

From: Yin Huai mailto:huaiyin@gmail.com>>
Date: Thursday, October 16, 2014 at 7:08 AM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: Michael Armbrust mailto:mich...@databricks.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

I guess you hit this bug<https://issues.apache.org/jira/browse/SPARK-3559>. The 
list of needed column ids was messed up. Can you try the master branch or apply 
the code 
change<https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4>
 to your 1.1 and see if the problem is resolved?

Thanks,

Yin

On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi Yin,

pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 
0.12 from existing Avro data using CREATE TABLE following by an INSERT 
OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition 
while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed 
that when I populated it with a single INSERT OVERWRITE over all the partitions 
and then executed the Spark code, it would report an illegal index value of 29. 
 However, if I manually did INSERT OVERWRITE for every single partition, I 
would get an illegal index value of 21. I don’t know if this will help in 
debugging, but here’s the DESCRIBE output for pqt_segcust_snappy:


OK

col_namedata_type   comment

customer_id string  from deserializer

age_range   string  from deserializer

gender  string  from deserializer

last_tx_datebigint  from deserializer

last_tx_date_ts string  from deserializer

last_tx_date_dt string  from deserializer

first_tx_date   bigint  from deserializer

first_tx_date_tsstring  from deserializer

first_tx_date_dtstring  from deserializer

second_tx_date  bigint  from deserializer

second_tx_date_ts   string  from deserializer

second_tx_date_dt   string  from deserializer

third_tx_date   bigint  from deserializer

third_tx_date_tsstring  from deserializer

third_tx_date_dtstring  from deserializer

frequency   double  from deserializer

tx_size double  from deserializer

recency double  from deserializer

rfm double  from deserializer

tx_countbigint  from deserializer

sales   double  from deserializer

coll_def_id string  None

seg_def_id  string  None



# Partition Information

# col_name  data_type   comment



coll_def_id string  None

seg_def_id  string  None

Time taken: 0.788 seconds, Fetched: 29 row(s)


As you can see, I have 21 data columns, followed by the 2 partition columns, 
coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just 
counting the rows in the console output. Let me know if you need more 
information.


Thanks

-Terry


From: Yin Huai mailto:huaiyin@gmail.com>>
Date: Tuesday, October 14, 2014 at 6:29 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: Michael Armbrust mailto:mich...@databricks.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>

Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMet

RDD to Multiple Tables SparkSQL

2014-10-20 Thread critikaled
Hi I have a rdd which I want to register as multiple tables based on key


val context = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(context)
import sqlContext.createSchemaRDD

case class KV(key:String,id:String,value:String)
val logsRDD = context.textFile("logs", 10).map{line=>
  val Array(key,id,value) = line split ' '
  (key,id,value)
}.registerTempTable("KVS")

I want to store the above information to multiple tables based on key
without bringing the entire data to master

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Earthson
I'm trying to give API interface to Java users. And I need to accept their
JavaSchemaRDDs, and convert it to SchemaRDD for Scala users.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-16 Thread Yin Huai
Hello Terry,

I guess you hit this bug <https://issues.apache.org/jira/browse/SPARK-3559>.
The list of needed column ids was messed up. Can you try the master branch
or apply the code change
<https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4>
to
your 1.1 and see if the problem is resolved?

Thanks,

Yin

On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu 
wrote:

>  Hi Yin,
>
>  pqt_rdt_snappy has 76 columns. These two parquet tables were created via
> Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT
> OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition
> while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I
> noticed that when I populated it with a single INSERT OVERWRITE over all
> the partitions and then executed the Spark code, it would report an illegal
> index value of 29.  However, if I manually did INSERT OVERWRITE for every
> single partition, I would get an illegal index value of 21. I don’t know if
> this will help in debugging, but here’s the DESCRIBE output for
> pqt_segcust_snappy:
>
>   OK
>
> col_namedata_type   comment
>
> customer_id string  from deserializer
>
> age_range   string  from deserializer
>
> gender  string  from deserializer
>
> last_tx_datebigint  from deserializer
>
> last_tx_date_ts string  from deserializer
>
> last_tx_date_dt string  from deserializer
>
> first_tx_date   bigint  from deserializer
>
> first_tx_date_tsstring  from deserializer
>
> first_tx_date_dtstring  from deserializer
>
> second_tx_date  bigint  from deserializer
>
> second_tx_date_ts   string  from deserializer
>
> second_tx_date_dt   string  from deserializer
>
> third_tx_date   bigint  from deserializer
>
> third_tx_date_tsstring  from deserializer
>
> third_tx_date_dtstring  from deserializer
>
> frequency   double  from deserializer
>
> tx_size double  from deserializer
>
> recency double  from deserializer
>
> rfm double  from deserializer
>
> tx_countbigint  from deserializer
>
> sales   double  from deserializer
>
> coll_def_id string  None
>
> seg_def_id  string  None
>
>
>
> # Partition Information
>
> # col_name  data_type   comment
>
>
>
> coll_def_id string  None
>
> seg_def_id  string  None
>
> Time taken: 0.788 seconds, Fetched: 29 row(s)
>
>
>  As you can see, I have 21 data columns, followed by the 2 partition
> columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks
> like it’s just counting the rows in the console output. Let me know if you
> need more information.
>
>
>  Thanks
>
> -Terry
>
>
>   From: Yin Huai 
> Date: Tuesday, October 14, 2014 at 6:29 PM
> To: Terry Siu 
> Cc: Michael Armbrust , "user@spark.apache.org" <
> user@spark.apache.org>
>
> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>
>   Hello Terry,
>
>  How many columns does pqt_rdt_snappy have?
>
>  Thanks,
>
>  Yin
>
> On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
> wrote:
>
>>  Hi Michael,
>>
>>  That worked for me. At least I’m now further than I was. Thanks for the
>> tip!
>>
>>  -Terry
>>
>>   From: Michael Armbrust 
>> Date: Monday, October 13, 2014 at 5:05 PM
>> To: Terry Siu 
>> Cc: "user@spark.apache.org" 
>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>>
>>   There are some known bug with the parquet serde and spark 1.1.
>>
>>  You can try setting spark.sql.hive.convertMetastoreParquet=true to
>> cause spark sql to use built in parquet support when the serde looks like
>> parquet.
>>
>> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
>> wrote:
>>
>>>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
>>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
>>> Hive tables that point to Parquet (

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Cheng Lian
Why do you need to convert a JavaSchemaRDD to SchemaRDD? Are you trying 
to use some API that doesn't exist in JavaSchemaRDD?


On 10/15/14 5:50 PM, Earthson wrote:

I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found
that DataTypeConversions is protected[sql].

Finally I find this solution:



 jrdd.registerTempTable("transform_tmp")
 jrdd.sqlContext.sql("select * from transform_tmp")





Could Any One tell me that: Is it a good idea for me to *use catalyst as
DSL's execution engine?*

I am trying to build a DSL, And I want to confirm this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-16 Thread Cheng Lian
The warehouse location need to be specified before the |HiveContext| 
initialization, you can set it via:


|./bin/spark-sql --hiveconf 
hive.metastore.warehouse.dir=/home/spark/hive/warehouse
|

On 10/15/14 8:55 PM, Hao Ren wrote:


Hi,

The following query in sparkSQL 1.1.0 CLI doesn't work.

*SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse
;

create table test as
select v1.*, v2.card_type, v2.card_upgrade_time_black,
v2.card_upgrade_time_gold
from customer v1 left join customer_loyalty v2
on v1.account_id = v2.account_id
limit 5
;*

StackTrack =>

org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(*message:file:/user/hive/warehouse/test* is not a directory or
unable to create one)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:559)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.createTable(HiveMetastoreCatalog.scala:99)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:116)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:111)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$.apply(HiveMetastoreCatalog.scala:111)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:58)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: MetaException(message:file:/user/hive/warehouse/test is not a
directory or unable to create one)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1060)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1107)
at sun.reflect.GeneratedMethodAccessor133.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
at com.sun.proxy.$Proxy15.create_table_with_environment_context(Unknown
Source)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:482)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:471)
at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy16.createTable(Unknown Source)
  

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-15 Thread Terry Siu
Hi Yin,

pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 
0.12 from existing Avro data using CREATE TABLE following by an INSERT 
OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition 
while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed 
that when I populated it with a single INSERT OVERWRITE over all the partitions 
and then executed the Spark code, it would report an illegal index value of 29. 
 However, if I manually did INSERT OVERWRITE for every single partition, I 
would get an illegal index value of 21. I don’t know if this will help in 
debugging, but here’s the DESCRIBE output for pqt_segcust_snappy:


OK

col_namedata_type   comment

customer_id string  from deserializer

age_range   string  from deserializer

gender  string  from deserializer

last_tx_datebigint  from deserializer

last_tx_date_ts string  from deserializer

last_tx_date_dt string  from deserializer

first_tx_date   bigint  from deserializer

first_tx_date_tsstring  from deserializer

first_tx_date_dtstring  from deserializer

second_tx_date  bigint  from deserializer

second_tx_date_ts   string  from deserializer

second_tx_date_dt   string  from deserializer

third_tx_date   bigint  from deserializer

third_tx_date_tsstring  from deserializer

third_tx_date_dtstring  from deserializer

frequency   double  from deserializer

tx_size double  from deserializer

recency double  from deserializer

rfm double  from deserializer

tx_countbigint  from deserializer

sales   double  from deserializer

coll_def_id string  None

seg_def_id  string  None



# Partition Information

# col_name  data_type   comment



coll_def_id string  None

seg_def_id  string  None

Time taken: 0.788 seconds, Fetched: 29 row(s)


As you can see, I have 21 data columns, followed by the 2 partition columns, 
coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just 
counting the rows in the console output. Let me know if you need more 
information.


Thanks

-Terry


From: Yin Huai mailto:huaiyin@gmail.com>>
Date: Tuesday, October 14, 2014 at 6:29 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: Michael Armbrust mailto:mich...@databricks.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark 
sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala> val hc = new HiveContext(sc)

scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 
132537600 and translate <= 134006399”)

scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala> txn.registerAsTable(“segTxns”)

scala> segcust.registerAsTable(“segCusts”)

scala> val joined = hc.sql(“select t.transid, c.customer_id fro

SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-15 Thread Hao Ren
Hi,

The following query in sparkSQL 1.1.0 CLI doesn't work.

*SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse
;

create table test as
select v1.*, v2.card_type, v2.card_upgrade_time_black,
v2.card_upgrade_time_gold
from customer v1 left join customer_loyalty v2
on v1.account_id = v2.account_id
limit 5
;*

StackTrack =>

org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(*message:file:/user/hive/warehouse/test* is not a directory or
unable to create one)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:559)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.createTable(HiveMetastoreCatalog.scala:99)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:116)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:111)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$.apply(HiveMetastoreCatalog.scala:111)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:58)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: MetaException(message:file:/user/hive/warehouse/test is not a
directory or unable to create one)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1060)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1107)
at sun.reflect.GeneratedMethodAccessor133.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
at com.sun.proxy.$Proxy15.create_table_with_environment_context(Unknown
Source)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:482)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:471)
at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy16.createTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:596)
... 30 more

It seems that CLI doesn't take the hive.metastore.warehouse.dir value when
creating table with "as select ...".
If just create the table, like "create t

[SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-15 Thread Earthson
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found
that DataTypeConversions is protected[sql].

Finally I find this solution: 



jrdd.registerTempTable("transform_tmp")
jrdd.sqlContext.sql("select * from transform_tmp")





Could Any One tell me that: Is it a good idea for me to *use catalyst as
DSL's execution engine?*

I am trying to build a DSL, And I want to confirm this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Yin Huai
Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
wrote:

>  Hi Michael,
>
>  That worked for me. At least I’m now further than I was. Thanks for the
> tip!
>
>  -Terry
>
>   From: Michael Armbrust 
> Date: Monday, October 13, 2014 at 5:05 PM
> To: Terry Siu 
> Cc: "user@spark.apache.org" 
> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>
>   There are some known bug with the parquet serde and spark 1.1.
>
>  You can try setting spark.sql.hive.convertMetastoreParquet=true to cause
> spark sql to use built in parquet support when the serde looks like parquet.
>
> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
> wrote:
>
>>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
>> Hive tables that point to Parquet (compressed with Snappy), which were
>> converted over from Avro if that matters.
>>
>>  I am trying to perform a join with these two Hive tables, but am
>> encountering an exception. In a nutshell, I launch a spark shell, create my
>> HiveContext (pointing to the correct metastore on our cluster), and then
>> proceed to do the following:
>>
>>  scala> val hc = new HiveContext(sc)
>>
>>  scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate
>> >= 132537600 and translate <= 134006399”)
>>
>>  scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where
>> coll_def_id=‘abcd’”)
>>
>>  scala> txn.registerAsTable(“segTxns”)
>>
>>  scala> segcust.registerAsTable(“segCusts”)
>>
>>  scala> val joined = hc.sql(“select t.transid, c.customer_id from
>> segTxns t join segCusts c on t.customerid=c.customer_id”)
>>
>>  Straight forward enough, but I get the following exception:
>>
>>  14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0
>> (TID 51)
>>
>> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21
>>
>> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>>
>> at java.util.ArrayList.get(ArrayList.java:411)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
>>
>> at
>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>>
>>  The number of columns in my table, pqt_segcust_snappy, has 21 columns
>> and two partitions defined. Does this error look familiar to anyone? Could
>> my usage of SparkSQL with Hive be incorrect or is support with
>> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0?
>>
>>
>>  Thanks,
>>
>> -Terry
>>
>
>


Re: How to patch sparkSQL on EC2?

2014-10-14 Thread Christos Kozanitis
ahhh never mind… I didn’t notice that a spark-assembly jar file gets produced 
after compiling the whole spark suite… So no more manual editing of the jar 
file of the AMI for now!

Christos 


On Oct 10, 2014, at 12:15 AM, Christos Kozanitis wrote:

> Hi
> 
> I have written a few extensions for sparkSQL (for version 1.1.0) and I am 
> trying to deploy my new jar files (one for catalyst and one for sql/core) on 
> ec2.
> 
> My approach was to create a new 
> spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar that merged the contents of 
> the old one with the contents of my new jar files and I propagated the 
> changes to workers. 
> 
> However when I tried the code snippet below I received the error message that 
> I paste at the end of this email. I was wondering, do you guys have any 
> suggestions on how to fix this?
> 
> thanks
> Christos
> 
> the code is:
> 
> import org.apache.spark.{SparkContext, SparkConf}
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> 
> the error message is:
> 
> error: bad symbolic reference. A signature in package.class refers to term 
> scalalogging
> in package com.typesafe which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> package.class.
> :14: error: bad symbolic reference. A signature in package.class 
> refers to term slf4j
> in value com.typesafe.scalalogging which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> package.class.
>val sqlContext = new SQLContext(sc)
> 



Re: Does SparkSQL work with custom defined SerDe?

2014-10-14 Thread Chen Song
Looks like it may be related to
https://issues.apache.org/jira/browse/SPARK-3807.

I will build from branch 1.1 to see if the issue is resolved.

Chen

On Tue, Oct 14, 2014 at 10:33 AM, Chen Song  wrote:

> Sorry for bringing this out again, as I have no clue what could have
> caused this.
>
> I turned on DEBUG logging and did see the jar containing the SerDe class
> was scanned.
>
> More interestingly, I saw the same exception
> (org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes) when running simple select on valid column names and malformed
> column names. This lead me to suspect that syntactical breaks somewhere.
>
> select [valid_column] from table limit 5;
> select [malformed_typo_column] from table limit 5;
>
>
> On Mon, Oct 13, 2014 at 6:04 PM, Chen Song  wrote:
>
>> In Hive, the table was created with custom SerDe, in the following way.
>>
>> row format serde "abc.ProtobufSerDe"
>>
>> with serdeproperties ("serialization.class"=
>> "abc.protobuf.generated.LogA$log_a")
>>
>> When I start spark-sql shell, I always got the following exception, even
>> for a simple query.
>>
>> select user from log_a limit 25;
>>
>> I can desc the table without any problem. When I explain the query, I got
>> the same exception.
>>
>>
>> 14/10/13 22:01:13 INFO impl.AMRMClientImpl: Waiting for application to be
>> successfully unregistered.
>>
>> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
>>
>> Caused by:
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
>> attributes: 'user, tree:
>>
>> Project ['user]
>>
>>  Filter (dh#4 = 2014-10-13 05)
>>
>>   LowerCaseSchema
>>
>>MetastoreRelation test, log_a, None
>>
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
>>
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>
>> at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>
>> at scala.collection.TraversableOnce$class.to
>> (TraversableOnce.scala:273)
>>
>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>
>> at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>
>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>
>> at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>>
>> at
>> scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>>
>> at scala.co

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread Michael Armbrust
Its much more efficient to store and compute on numeric types than string
types.

On Tue, Oct 14, 2014 at 1:25 AM, invkrh  wrote:

> Thank you, Michael.
>
> In Spark SQL DataType, we have a lot of types, for example, ByteType,
> ShortType, StringType, etc.
>
> These types are used to form the table schema. As for me, StringType is
> enough, why do we need others ?
>
> Hao
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-StringType-for-numeric-comparison-tp16295p16361.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Terry Siu
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark 
sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala> val hc = new HiveContext(sc)

scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 
132537600 and translate <= 134006399”)

scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala> txn.registerAsTable(“segTxns”)

scala> segcust.registerAsTable(“segCusts”)

scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join 
segCusts c on t.customerid=c.customer_id”)

Straight forward enough, but I get the following exception:


14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)

java.lang.IndexOutOfBoundsException: Index: 21, Size: 21

at java.util.ArrayList.rangeCheck(ArrayList.java:635)

at java.util.ArrayList.get(ArrayList.java:411)

at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)

at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)

at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The number of columns in my table, pqt_segcust_snappy, has 21 columns and two 
partitions defined. Does this error look familiar to anyone? Could my usage of 
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning 
still buggy at this point in Spark 1.1.0?


Thanks,

-Terry





Re: Does SparkSQL work with custom defined SerDe?

2014-10-14 Thread Chen Song
Sorry for bringing this out again, as I have no clue what could have caused
this.

I turned on DEBUG logging and did see the jar containing the SerDe class
was scanned.

More interestingly, I saw the same exception
(org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes) when running simple select on valid column names and malformed
column names. This lead me to suspect that syntactical breaks somewhere.

select [valid_column] from table limit 5;
select [malformed_typo_column] from table limit 5;


On Mon, Oct 13, 2014 at 6:04 PM, Chen Song  wrote:

> In Hive, the table was created with custom SerDe, in the following way.
>
> row format serde "abc.ProtobufSerDe"
>
> with serdeproperties ("serialization.class"=
> "abc.protobuf.generated.LogA$log_a")
>
> When I start spark-sql shell, I always got the following exception, even
> for a simple query.
>
> select user from log_a limit 25;
>
> I can desc the table without any problem. When I explain the query, I got
> the same exception.
>
>
> 14/10/13 22:01:13 INFO impl.AMRMClientImpl: Waiting for application to be
> successfully unregistered.
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
>
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
> Unresolved attributes: 'user, tree:
>
> Project ['user]
>
>  Filter (dh#4 = 2014-10-13 05)
>
>   LowerCaseSchema
>
>MetastoreRelation test, log_a, None
>
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
>
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>
> at
> scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>
> at scala.collection.immutable.List.foreach(List.scala:318)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:397)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:397)
>
> at
> 

Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
Thank you, Gen.

I will give hiveContext a try. =)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: select syntax

2014-10-14 Thread Gen
Hi,

I met the same problem before, and according to Matei Zaharia:
/The issue is that you're using SQLContext instead of HiveContext.
SQLContext implements a smaller subset of the SQL language and so you're
getting a SQL parse error because it doesn't support the syntax you have.
Look at how you'd write this in HiveQL, and then try doing that with
HiveContext./

In fact, there are more problems than that. The sparkSQL will conserve
(15+5=20) columns in the final table, if I remember well. Therefore, when
you are doing join on two tables which have the same columns will cause
doublecolumn error.

Cheers
Gen


Hao Ren wrote
> Update:
> 
> This syntax is mainly for avoiding retyping column names.
> 
> Let's take the example in my previous post, where 
*
> a
*
>  is a table of 15 columns, 
*
> b
*
>  has 5 columns, after a join, I have a table of (15 + 5 - 1(key in b)) =
> 19 columns and register the table in sqlContext.
> 
> I don't want to actually retype all the 19 columns' name when querying
> with select. This feature exists in hive.
> But in SparkSql, it gives an exception.
> 
> Any ideas ? Thx
> 
> Hao





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
Update:

This syntax is mainly for avoiding retyping column names.

Let's take the example in my previous post, where *a* is a table of 15
columns, *b* has 5 columns, after a join, I have a table of (15 + 5 - 1(key
in b)) = 19 columns and register the table in sqlContext.

I don't want to actually retype all the 19 columns' name when querying with
select. This feature exists in hive.
But in SparkSql, it gives an exception.

Any ideas ? Thx

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread invkrh
Thank you, Michael.

In Spark SQL DataType, we have a lot of types, for example, ByteType,
ShortType, StringType, etc.

These types are used to form the table schema. As for me, StringType is
enough, why do we need others ?

Hao 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-StringType-for-numeric-comparison-tp16295p16361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Michael Armbrust
There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause
spark sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu  wrote:

>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
> Hive tables that point to Parquet (compressed with Snappy), which were
> converted over from Avro if that matters.
>
>  I am trying to perform a join with these two Hive tables, but am
> encountering an exception. In a nutshell, I launch a spark shell, create my
> HiveContext (pointing to the correct metastore on our cluster), and then
> proceed to do the following:
>
>  scala> val hc = new HiveContext(sc)
>
>  scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >=
> 132537600 and translate <= 134006399”)
>
>  scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where
> coll_def_id=‘abcd’”)
>
>  scala> txn.registerAsTable(“segTxns”)
>
>  scala> segcust.registerAsTable(“segCusts”)
>
>  scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns
> t join segCusts c on t.customerid=c.customer_id”)
>
>  Straight forward enough, but I get the following exception:
>
>   14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0
> (TID 51)
>
> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21
>
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>
> at java.util.ArrayList.get(ArrayList.java:411)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>  The number of columns in my table, pqt_segcust_snappy, has 21 columns
> and two partitions defined. Does this error look familiar to anyone? Could
> my usage of SparkSQL with Hive be incorrect or is support with
> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0?
>
>
>  Thanks,
>
> -Terry
>


Does SparkSQL work with custom defined SerDe?

2014-10-13 Thread Chen Song
In Hive, the table was created with custom SerDe, in the following way.

row format serde "abc.ProtobufSerDe"

with serdeproperties ("serialization.class"=
"abc.protobuf.generated.LogA$log_a")

When I start spark-sql shell, I always got the following exception, even
for a simple query.

select user from log_a limit 25;

I can desc the table without any problem. When I explain the query, I got
the same exception.


14/10/13 22:01:13 INFO impl.AMRMClientImpl: Waiting for application to be
successfully unregistered.

Exception in thread "Driver" java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Unresolved attributes: 'user, tree:

Project ['user]

 Filter (dh#4 = 2014-10-13 05)

  LowerCaseSchema

   MetastoreRelation test, log_a, None


at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273)

at scala.collection.AbstractIterator.to(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at
scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:397)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:397)

at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358)

at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)

at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)

at com.appnexus.data.spark.sql.Test$.main(Test.scala:23)

at com.appnexus.data.spark.sql.Test.main(Test.scala)

... 5

SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Terry Siu
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala> val hc = new HiveContext(sc)

scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 
132537600 and translate <= 134006399”)

scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala> txn.registerAsTable(“segTxns”)

scala> segcust.registerAsTable(“segCusts”)

scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join 
segCusts c on t.customerid=c.customer_id”)

Straight forward enough, but I get the following exception:


14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)

java.lang.IndexOutOfBoundsException: Index: 21, Size: 21

at java.util.ArrayList.rangeCheck(ArrayList.java:635)

at java.util.ArrayList.get(ArrayList.java:411)

at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)

at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)

at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The number of columns in my table, pqt_segcust_snappy, has 21 columns and two 
partitions defined. Does this error look familiar to anyone? Could my usage of 
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning 
still buggy at this point in Spark 1.1.0?


Thanks,

-Terry




<    4   5   6   7   8   9   10   11   12   >