SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1

2015-03-10 Thread Pei-Lun Lee
Hi,

I found that if I try to read parquet file generated by spark 1.1.1 using
1.3.0-rc3 by default settings, I got this error:

com.fasterxml.jackson.core.JsonParseException: Unrecognized token
'StructType': was expecting ('true', 'false' or 'null')
 at [Source: StructType(List(StructField(a,IntegerType,false))); line: 1,
column: 11]
at
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
at
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508)
at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2300)
at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1459)
at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683)
at
com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3105)
at
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051)
at
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
at org.apache.spark.sql.types.DataType$.fromJson(dataTypes.scala:41)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)



this is how I save parquet file with 1.1.1:

sql("select 1 as a").saveAsParquetFile("/tmp/foo")



and this is the meta data of the 1.1.1 parquet file:

creator: parquet-mr version 1.4.3
extra:   org.apache.spark.sql.parquet.row.metadata =
StructType(List(StructField(a,IntegerType,false)))



by comparison, this is 1.3.0 meta:

creator: parquet-mr version 1.6.0rc3
extra:   org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"a","type":"integer","nullable":t
[more]...



It looks like now ParquetRelation2 is used to load parquet file by default
and it only recognizes JSON format schema but 1.1.1 schema was case class
string format.

Setting spark.sql.parquet.useDataSourceApi to false will fix it, but I
don't know the differences.
Is this considered a bug? We have a lot of parquet files from 1.1.1, should
we disable data source api in order to read them if we want to upgrade to
1.3?

Thanks,
--
Pei-Lun


[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support.

 

In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table "src").

 

==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

==

After these steps, the tables are stored in "/test/w" only. I expect
"table2" to be stored in "/test/w2" folder.

 

Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
folder, I cannot use saveAsTable()? Is this by design? Exception stack
trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast
at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS:
hdfs://server:8020/space/warehouse/table2, expected: file:///

at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav
a:118)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:252)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:251)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at
scala.collection.AbstractTraversable.map(Traversable.scala:105)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP
arquet.scala:251)

at
org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:37
0)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:96)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:125)

at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru
n(commands.scala:217)

at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu
te(commands.scala:55)

at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands
.scala:55)

at
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65
)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte
xt.scala:1088)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10
88)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:20)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27)

at $iwC$$iwC$$iwC$$iwC$$iwC.(:29)

at $iwC$$iwC$$iwC$$iwC.(:31)

at $iwC$$iwC$$iwC.(:33)

at $iwC$$iwC.(:35)

at $iwC.(:37)

at (:39)

 

Thank you very much!

 



RE: Using CUDA within Spark / boosting linear algebra

2015-03-10 Thread Ulanov, Alexander
I can run benchmark on another machine with GPU nVidia Titan and Intel Xeon 
E5-2650 v2, although it runs Windows and I have to run Linux tests in 
VirtualBox.

It would be also interesting to add results on netlib+nvblas, however I am not 
sure I understand in details how to build this and will appreciate any help 
from you ☺

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" 
mailto:alexander.ula...@hp.com>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday 
[mailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; 
dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng mailto:men...@gmail.com>> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
> mailto:jos...@databricks.com>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> mailto:evan.spa...@gmail.com>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger 
>>> matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> alexander.ula...@hp.com> wrote:
>>>
 Just to summarize this thread, I was finally able to make all
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublas>>BIDMat
 MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
 =netlib-cublas>netlib-blas>f2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
 378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though
 the original one discusses slightly different topic. I was able to
 link netlib with MKL fr

Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Nitay Joffe
In Spark 1.2 I used to be able to do this:

scala>
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType("struct")
res30: org.apache.spark.sql.catalyst.types.DataType =
StructType(List(StructField(int,LongType,true)))

That is, the name of a column can be a keyword like "int". This is no
longer the case in 1.3:

data-pipeline-shell> HiveTypeHelper.toDataType("struct")
org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
failure: ``>'' expected but `int' found

struct
   ^
at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
at
org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
at
org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)

Note HiveTypeHelper is simply an object I load in to expose
HiveMetastoreTypes since it was made private. See
https://gist.github.com/nitay/460b41ed5fd7608507f5


This is actually a pretty big problem for us as we have a bunch of legacy
tables with column names like "timestamp". They work fine in 1.2, but now
everything throws in 1.3.

Any thoughts?

Thanks,
- Nitay
Founder & CTO


Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Michael Armbrust
Thanks for reporting.  This was a result of a change to our DDL parser that
resulted in types becoming reserved words.  I've filled a JIRA and will
investigate if this is something we can fix.
https://issues.apache.org/jira/browse/SPARK-6250

On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe  wrote:

> In Spark 1.2 I used to be able to do this:
>
> scala>
> org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType("struct")
> res30: org.apache.spark.sql.catalyst.types.DataType =
> StructType(List(StructField(int,LongType,true)))
>
> That is, the name of a column can be a keyword like "int". This is no
> longer the case in 1.3:
>
> data-pipeline-shell> HiveTypeHelper.toDataType("struct")
> org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
> failure: ``>'' expected but `int' found
>
> struct
>^
> at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
> at
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
> at
> org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)
>
> Note HiveTypeHelper is simply an object I load in to expose
> HiveMetastoreTypes since it was made private. See
> https://gist.github.com/nitay/460b41ed5fd7608507f5
> 
>
> This is actually a pretty big problem for us as we have a bunch of legacy
> tables with column names like "timestamp". They work fine in 1.2, but now
> everything throws in 1.3.
>
> Any thoughts?
>
> Thanks,
> - Nitay
> Founder & CTO
>
>


Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Yin Huai
Hi Nitay,

Can you try using backticks to quote the column name? Like
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(
"struct<`int`:bigint>")?

Thanks,

Yin

On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust 
wrote:

> Thanks for reporting.  This was a result of a change to our DDL parser
> that resulted in types becoming reserved words.  I've filled a JIRA and
> will investigate if this is something we can fix.
> https://issues.apache.org/jira/browse/SPARK-6250
>
> On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe  wrote:
>
>> In Spark 1.2 I used to be able to do this:
>>
>> scala>
>> org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType("struct")
>> res30: org.apache.spark.sql.catalyst.types.DataType =
>> StructType(List(StructField(int,LongType,true)))
>>
>> That is, the name of a column can be a keyword like "int". This is no
>> longer the case in 1.3:
>>
>> data-pipeline-shell> HiveTypeHelper.toDataType("struct")
>> org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
>> failure: ``>'' expected but `int' found
>>
>> struct
>>^
>> at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
>> at
>> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
>> at
>> org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)
>>
>> Note HiveTypeHelper is simply an object I load in to expose
>> HiveMetastoreTypes since it was made private. See
>> https://gist.github.com/nitay/460b41ed5fd7608507f5
>> 
>>
>> This is actually a pretty big problem for us as we have a bunch of legacy
>> tables with column names like "timestamp". They work fine in 1.2, but now
>> everything throws in 1.3.
>>
>> Any thoughts?
>>
>> Thanks,
>> - Nitay
>> Founder & CTO
>>
>>
>


Spark tests hang on local machine due to "testGuavaOptional" in JavaAPISuite

2015-03-10 Thread Ganelin, Ilya
Hi all – building Spark on my local machine with build/mvn clean package test 
runs until it hits the JavaAPISuite where it hangs indefinitely. Through some 
experimentation, I’ve narrowed it down to the following test:


/**
 * Test for SPARK-3647. This test needs to use the maven-built assembly to 
trigger the issue,
 * since that's the only artifact where Guava classes have been relocated.
 */
@Test
public void testGuavaOptional() {
  // Stop the context created in setUp() and start a local-cluster one, to 
force usage of the
  // assembly.
  sc.stop();
  JavaSparkContext localCluster = new 
JavaSparkContext("local-cluster[1,1,512]", "JavaAPISuite");
  try {
JavaRDD rdd1 = localCluster.parallelize(Arrays.asList(1, 2, null), 
3);
JavaRDD> rdd2 = rdd1.map(
  new Function>() {
@Override
public Optional call(Integer i) {
  return Optional.fromNullable(i);
}
  });
rdd2.collect();
  } finally {
localCluster.stop();
  }
}


If I remove this test, things work smoothly. Has anyone else seen this? Thanks.


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Spark tests hang on local machine due to "testGuavaOptional" in JavaAPISuite

2015-03-10 Thread Sean Owen
Yes and I remember it was caused by ... well something related to the
Guava shading and the fact that you're running a mini cluster and then
talking to it. I can't remember what exactly resolved it but try a
clean build. Somehow I think it had to do with multiple assembly files
or something like that.

On Wed, Mar 11, 2015 at 12:09 AM, Ganelin, Ilya
 wrote:
> Hi all – building Spark on my local machine with build/mvn clean package test 
> runs until it hits the JavaAPISuite where it hangs indefinitely. Through some 
> experimentation, I’ve narrowed it down to the following test:
>
>
> /**
>  * Test for SPARK-3647. This test needs to use the maven-built assembly to 
> trigger the issue,
>  * since that's the only artifact where Guava classes have been relocated.
>  */
> @Test
> public void testGuavaOptional() {
>   // Stop the context created in setUp() and start a local-cluster one, to 
> force usage of the
>   // assembly.
>   sc.stop();
>   JavaSparkContext localCluster = new 
> JavaSparkContext("local-cluster[1,1,512]", "JavaAPISuite");
>   try {
> JavaRDD rdd1 = localCluster.parallelize(Arrays.asList(1, 2, 
> null), 3);
> JavaRDD> rdd2 = rdd1.map(
>   new Function>() {
> @Override
> public Optional call(Integer i) {
>   return Optional.fromNullable(i);
> }
>   });
> rdd2.collect();
>   } finally {
> localCluster.stop();
>   }
> }
>
>
> If I remove this test, things work smoothly. Has anyone else seen this? 
> Thanks.
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Cheng, Hao
I am not so sure if Hive supports change the metastore after initialized, I 
guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably 
that's why it doesn't work as expected for Q1.

BTW, in most of cases, people configure the metastore settings in 
hive-site.xml, and will not change that since then, is there any reason that 
you want to change that in runtime?

For Q2, probably something wrong in configuration, seems the HDFS run into the 
pseudo/single node mode, can you double check that? Or can you run the DDL 
(like create a table) from the spark shell with HiveContext?

From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?


I'm using Spark 1.3.0 RC3 build with Hive support.



In Spark Shell, I want to reuse the HiveContext instance to different warehouse 
locations. Below are the steps for my test (Assume I have loaded a file into 
table "src").



==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

==

After these steps, the tables are stored in "/test/w" only. I expect "table2" 
to be stored in "/test/w2" folder.



Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS folder, 
I cannot use saveAsTable()? Is this by design? Exception stack trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at 
TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS: 
hdfs://server:8020/space/warehouse/table2, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at 
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)

at 
org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:370)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)

at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at 
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:20)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27)

at $iwC$$iwC$$iwC$$iwC$$iwC.(:29)

at $iwC$$iwC$$iwC$$iwC.(:31)

at $iwC$$iwC$$iwC.(:33)

at $iwC$$iwC.(:35)

at $iwC.(:37)

at (:39)



Thank you very much!




GitHub Syncing Down

2015-03-10 Thread Michael Armbrust
FYI: https://issues.apache.org/jira/browse/INFRA-9259


Re: SparkSpark-perf terasort WIP branch

2015-03-10 Thread Reynold Xin
Hi Ewan,

Sorry it took a while for us to reply. I don't know spark-perf that well,
but I think this would be problematic if it works with only a specific
version of Hadoop. Maybe we can take a different approach -- just have a
bunch of tasks using the HDFS client API to read data, and not relying on
input formats?


On Fri, Mar 6, 2015 at 1:41 AM, Ewan Higgs  wrote:

> Hi all,
> I never heard from anyone on this and have received emails in private that
> people would like to add terasort to their spark-perf installs so it
> becomes part of their cluster validation checks.
>
> Yours,
> Ewan
>
>
>  Forwarded Message 
> Subject:SparkSpark-perf terasort WIP branch
> Date:   Wed, 14 Jan 2015 14:33:45 +0100
> From:   Ewan Higgs 
> To: dev@spark.apache.org 
>
>
>
> Hi all,
> I'm trying to build the Spark-perf WIP code but there are some errors to
> do with Hadoop APIs. I presume this is because there is some Hadoop
> version set and it's referring to that. But I can't seem to find it.
>
> The errors are as follows:
>
> [info] Compiling 15 Scala sources and 2 Java sources to
> /home/ehiggs/src/spark-perf/spark-tests/target/scala-2.10/classes...
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraInputFormat.scala:40:
> object task is not a member of package org.apache.hadoop.mapreduce
> [error] import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraInputFormat.scala:132:
> not found: type TaskAttemptContextImpl
> [error] val context = new TaskAttemptContextImpl(
> [error]   ^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraScheduler.scala:37:
> object TTConfig is not a member of package
> org.apache.hadoop.mapreduce.server.tasktracker
> [error] import org.apache.hadoop.mapreduce.server.tasktracker.TTConfig
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraScheduler.scala:91:
> not found: value TTConfig
> [error]   var slotsPerHost : Int = conf.getInt(TTConfig.TT_MAP_SLOTS, 4)
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraSortAll.scala:7:
> value run is not a member of org.apache.spark.examples.terasort.TeraGen
> [error] tg.run(Array[String]("10M", "/tmp/terasort_in"))
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraSortAll.scala:9:
> value run is not a member of org.apache.spark.examples.terasort.TeraSort
> [error] ts.run(Array[String]("/tmp/terasort_in", "/tmp/terasort_out"))
> [error]^
> [error] 6 errors found
> [error] (compile:compile) Compilation failed
> [error] Total time: 13 s, completed 05-Jan-2015 12:21:47
>
> I can build the same code if it's in the Spark tree using the following
> command:
> mvn -Dhadoop.version=2.5.0 -DskipTests=true install
>
> Is there a way I can convince spark-perf to build this code with the
> appropriate Hadoop library version? I tried to apply the following to
> spark-tests/project/SparkTestsBuild.scala but it didn't seem to work as
> I expected:
>
> $ git diff project/SparkTestsBuild.scala
> diff --git a/spark-tests/project/SparkTestsBuild.scala
> b/spark-tests/project/SparkTestsBuild.scala
> index 4116326..4ed5f0c 100644
> --- a/spark-tests/project/SparkTestsBuild.scala
> +++ b/spark-tests/project/SparkTestsBuild.scala
> @@ -16,7 +16,9 @@ object SparkTestsBuild extends Build {
>   "org.scalatest" %% "scalatest" % "2.2.1" % "test",
>   "com.google.guava" % "guava" % "14.0.1",
>   "org.apache.spark" %% "spark-core" % "1.0.0" % "provided",
> -"org.json4s" %% "json4s-native" % "3.2.9"
> +"org.json4s" %% "json4s-native" % "3.2.9",
> +"org.apache.hadoop" % "hadoop-common" % "2.5.0",
> +"org.apache.hadoop" % "hadoop-mapreduce" % "2.5.0"
> ),
> test in assembly := {},
> outputPath in assembly :=
> file("target/spark-perf-tests-assembly.jar"),
> @@ -36,4 +38,4 @@ object SparkTestsBuild extends Build {
>   case _ => MergeStrategy.first
> }
>   ))
> -}
> \ No newline at end of file
> +}
>
>
> Yours,
> Ewan
>
>
>
>


[RESULT] [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-10 Thread Patrick Wendell
This vote passes with 13 +1 votes (6 binding) and no 0 or -1 votes:

+1 (13):
Patrick Wendell*
Marcelo Vanzin
Krishna Sankar
Sean Owen*
Matei Zaharia*
Sandy Ryza
Tom Graves*
Sean McNamara*
Denny Lee
Kostas Sakellis
Joseph Bradley*
Corey Nolet
GuoQiang Li

0:
-1:

I will finalize the release notes and packaging and will post the
release in the next two days.

- Patrick

On Mon, Mar 9, 2015 at 11:51 PM, GuoQiang Li  wrote:
> I'm sorry, this is my mistake. :)
>
>
> -- 原始邮件 --
> 发件人: "Patrick Wendell";
> 发送时间: 2015年3月10日(星期二) 下午2:20
> 收件人: "GuoQiang Li";
> 主题: Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
>
> Thanks! But please e-mail the dev list and not just me personally :)
>
> On Mon, Mar 9, 2015 at 11:08 PM, GuoQiang Li  wrote:
>> +1 (non-binding)
>>
>> Test on Mac OS X 10.10.2 and CentOS 6.5
>>
>>
>> -- Original --
>> From:  "Patrick Wendell";;
>> Date:  Fri, Mar 6, 2015 10:52 AM
>> To:  "dev@spark.apache.org";
>> Subject:  [VOTE] Release Apache Spark 1.3.0 (RC3)
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.3.0!
>>
>> The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> Staging repositories for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1078
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.0!
>>
>> The vote is open until Monday, March 09, at 02:52 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How does this compare to RC2 ==
>> This release includes the following bug fixes:
>>
>> https://issues.apache.org/jira/browse/SPARK-6144
>> https://issues.apache.org/jira/browse/SPARK-6171
>> https://issues.apache.org/jira/browse/SPARK-5143
>> https://issues.apache.org/jira/browse/SPARK-6182
>> https://issues.apache.org/jira/browse/SPARK-6175
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.2 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> If you are happy with this release based on your own testing, give a +1
>> vote.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.3 QA period,
>> so -1 votes should only occur for significant regressions from 1.2.1.
>> Bugs already present in 1.2.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
Hao, thanks for the response.

 

For Q1, in my case, I have a tool on SparkShell which serves multiple
users where they can use different Hive installation. I take a look at
the code of HiveContext. It looks like I cannot do that today because
"catalog" field cannot be changed after initialize.

 

  /* A catalyst metadata catalog that points to the Hive Metastore. */

  @transient

  override protected[sql] lazy val catalog = new
HiveMetastoreCatalog(this) with OverrideCatalog

 

For Q2, I check HDFS and it is running as a cluster. I can run the DDL
from spark shell with HiveContext as well. To reproduce the exception, I
just run below script. It happens in the last step.

 

15/03/11 14:24:48 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET
hive.metastore.warehouse.dir=hdfs://server:8020/space/warehouse")

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src(key INT, value
STRING)")

scala> sqlContext.sql("LOAD DATA LOCAL INPATH
'examples/src/main/resources/kv1.txt' INTO TABLE src")

scala> var output = sqlContext.sql("SELECT key,value FROM src")

scala> output.saveAsTable("outputtable")

 



From: Cheng, Hao [mailto:hao.ch...@intel.com] 
Sent: Wednesday, March 11, 2015 8:25 AM
To: Haopu Wang; user; dev@spark.apache.org
Subject: RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

 

I am not so sure if Hive supports change the metastore after
initialized, I guess not. Spark SQL totally rely on Hive Metastore in
HiveContext, probably that's why it doesn't work as expected for Q1.

 

BTW, in most of cases, people configure the metastore settings in
hive-site.xml, and will not change that since then, is there any reason
that you want to change that in runtime?

 

For Q2, probably something wrong in configuration, seems the HDFS run
into the pseudo/single node mode, can you double check that? Or can you
run the DDL (like create a table) from the spark shell with HiveContext?


 

From: Haopu Wang [mailto:hw...@qilinsoft.com] 
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?

 

I'm using Spark 1.3.0 RC3 build with Hive support.

 

In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table "src").

 

==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

==

After these steps, the tables are stored in "/test/w" only. I expect
"table2" to be stored in "/test/w2" folder.

 

Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
folder, I cannot use saveAsTable()? Is this by design? Exception stack
trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast
at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS:
hdfs://server:8020/space/warehouse/table2, expected: file:///
 

at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav
a:118)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:252)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:251)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at
scala.collection.AbstractTraversable.map(Traversable.scala:105)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP
arquet.scala:251)

at
org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:37
0)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:96)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:125)

at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru
n(commands.scala:217)

at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu
te(commands.scala:55)

at
org.apach