[SparkSQL] Reuse HiveContext to different Hive warehouse?
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w) scala sqlContext.sql(SELECT * from src).saveAsTable(table1) scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2) scala sqlContext.sql(SELECT * from src).saveAsTable(table2) == After these steps, the tables are stored in /test/w only. I expect table2 to be stored in /test/w2 folder. Another question is: if I set hive.metastore.warehouse.dir to a HDFS folder, I cannot use saveAsTable()? Is this by design? Exception stack trace is below: == 15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:74 java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/space/warehouse/table2, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463) at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav a:118) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a pply(newParquet.scala:252) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a pply(newParquet.scala:251) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP arquet.scala:251) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:37 0) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca la:96) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca la:125) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308) at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru n(commands.scala:217) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu te(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands .scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65 ) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte xt.scala:1088) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10 88) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC.init(console:35) at $iwC.init(console:37) at init(console:39) Thank you very much!
GitHub Syncing Down
FYI: https://issues.apache.org/jira/browse/INFRA-9259
RE: Using CUDA within Spark / boosting linear algebra
I can run benchmark on another machine with GPU nVidia Titan and Intel Xeon E5-2650 v2, although it runs Windows and I have to run Linux tests in VirtualBox. It would be also interesting to add results on netlib+nvblas, however I am not sure I understand in details how to build this and will appreciate any help from you ☺ From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.commailto:men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo= =netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx 378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native
Re: Spark 1.3 SQL Type Parser Changes?
Thanks for reporting. This was a result of a change to our DDL parser that resulted in types becoming reserved words. I've filled a JIRA and will investigate if this is something we can fix. https://issues.apache.org/jira/browse/SPARK-6250 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote: In Spark 1.2 I used to be able to do this: scala org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint) res30: org.apache.spark.sql.catalyst.types.DataType = StructType(List(StructField(int,LongType,true))) That is, the name of a column can be a keyword like int. This is no longer the case in 1.3: data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint) org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8] failure: ``'' expected but `int' found structint:bigint ^ at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785) at org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9) Note HiveTypeHelper is simply an object I load in to expose HiveMetastoreTypes since it was made private. See https://gist.github.com/nitay/460b41ed5fd7608507f5 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c This is actually a pretty big problem for us as we have a bunch of legacy tables with column names like timestamp. They work fine in 1.2, but now everything throws in 1.3. Any thoughts? Thanks, - Nitay Founder CTO
SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1
Hi, I found that if I try to read parquet file generated by spark 1.1.1 using 1.3.0-rc3 by default settings, I got this error: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'StructType': was expecting ('true', 'false' or 'null') at [Source: StructType(List(StructField(a,IntegerType,false))); line: 1, column: 11] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2300) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1459) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683) at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3105) at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161) at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19) at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44) at org.apache.spark.sql.types.DataType$.fromJson(dataTypes.scala:41) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675) this is how I save parquet file with 1.1.1: sql(select 1 as a).saveAsParquetFile(/tmp/foo) and this is the meta data of the 1.1.1 parquet file: creator: parquet-mr version 1.4.3 extra: org.apache.spark.sql.parquet.row.metadata = StructType(List(StructField(a,IntegerType,false))) by comparison, this is 1.3.0 meta: creator: parquet-mr version 1.6.0rc3 extra: org.apache.spark.sql.parquet.row.metadata = {type:struct,fields:[{name:a,type:integer,nullable:t [more]... It looks like now ParquetRelation2 is used to load parquet file by default and it only recognizes JSON format schema but 1.1.1 schema was case class string format. Setting spark.sql.parquet.useDataSourceApi to false will fix it, but I don't know the differences. Is this considered a bug? We have a lot of parquet files from 1.1.1, should we disable data source api in order to read them if we want to upgrade to 1.3? Thanks, -- Pei-Lun
Re: Spark 1.3 SQL Type Parser Changes?
Hi Nitay, Can you try using backticks to quote the column name? Like org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType( struct`int`:bigint)? Thanks, Yin On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: Thanks for reporting. This was a result of a change to our DDL parser that resulted in types becoming reserved words. I've filled a JIRA and will investigate if this is something we can fix. https://issues.apache.org/jira/browse/SPARK-6250 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote: In Spark 1.2 I used to be able to do this: scala org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint) res30: org.apache.spark.sql.catalyst.types.DataType = StructType(List(StructField(int,LongType,true))) That is, the name of a column can be a keyword like int. This is no longer the case in 1.3: data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint) org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8] failure: ``'' expected but `int' found structint:bigint ^ at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785) at org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9) Note HiveTypeHelper is simply an object I load in to expose HiveMetastoreTypes since it was made private. See https://gist.github.com/nitay/460b41ed5fd7608507f5 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c This is actually a pretty big problem for us as we have a bunch of legacy tables with column names like timestamp. They work fine in 1.2, but now everything throws in 1.3. Any thoughts? Thanks, - Nitay Founder CTO
Spark tests hang on local machine due to testGuavaOptional in JavaAPISuite
Hi all – building Spark on my local machine with build/mvn clean package test runs until it hits the JavaAPISuite where it hangs indefinitely. Through some experimentation, I’ve narrowed it down to the following test: /** * Test for SPARK-3647. This test needs to use the maven-built assembly to trigger the issue, * since that's the only artifact where Guava classes have been relocated. */ @Test public void testGuavaOptional() { // Stop the context created in setUp() and start a local-cluster one, to force usage of the // assembly. sc.stop(); JavaSparkContext localCluster = new JavaSparkContext(local-cluster[1,1,512], JavaAPISuite); try { JavaRDDInteger rdd1 = localCluster.parallelize(Arrays.asList(1, 2, null), 3); JavaRDDOptionalInteger rdd2 = rdd1.map( new FunctionInteger, OptionalInteger() { @Override public OptionalInteger call(Integer i) { return Optional.fromNullable(i); } }); rdd2.collect(); } finally { localCluster.stop(); } } If I remove this test, things work smoothly. Has anyone else seen this? Thanks. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?
I am not so sure if Hive supports change the metastore after initialized, I guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably that's why it doesn't work as expected for Q1. BTW, in most of cases, people configure the metastore settings in hive-site.xml, and will not change that since then, is there any reason that you want to change that in runtime? For Q2, probably something wrong in configuration, seems the HDFS run into the pseudo/single node mode, can you double check that? Or can you run the DDL (like create a table) from the spark shell with HiveContext? From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, March 10, 2015 6:38 PM To: user; dev@spark.apache.org Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse? I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w) scala sqlContext.sql(SELECT * from src).saveAsTable(table1) scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2) scala sqlContext.sql(SELECT * from src).saveAsTable(table2) == After these steps, the tables are stored in /test/w only. I expect table2 to be stored in /test/w2 folder. Another question is: if I set hive.metastore.warehouse.dir to a HDFS folder, I cannot use saveAsTable()? Is this by design? Exception stack trace is below: == 15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:74 java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/space/warehouse/table2, expected: file:///file:///\\ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463) at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:370) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308) at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC.init(console:35) at $iwC.init(console:37) at init(console:39) Thank you very much!