date:20181219

[SPARK SQL] Difference between 'Hive on spark' and Spark SQL

2018-12-19 Thread luby

Hi, All,

We are starting to migrate our data to Hadoop platform in hoping to use
'Big Data' technologies to
improve our business.

We are new in the area and want to get some help from you.

Currently all our data is put into Hive and some complicated SQL query
statements are run daily.

We want to improve the performance of these queries and have two options
at hand:
a. Turn on 'Hive on spark' feature and run HQLs and
b. Run those query statements with spark SQL

What the difference between these options?

Another question is:
There is a hive setting 'hive.optimze.ppd' to enable 'predicated pushdown'
query optimize
Is ther equivalent option in spark sql or the same setting also works for
spark SQL?

Thanks in advance

Boying

本邮件内容包含保密信息。如阁下并非拟发送的收件人，请您不要阅读、保存、对外
披露或复制本邮件的任何内容，或者打开本邮件的任何附件。请即回复邮件告知发件
人，并立刻将该邮件及其附件从您的电脑系统中全部删除，不胜感激。

This email message may contain confidential and/or privileged information.
If you are not the intended recipient, please do not read, save, forward,
disclose or copy the contents of this email or open any file attached to
this email. We will be grateful if you could advise the sender immediately
by replying this email, and delete this email and any attachment or links
to this email completely and immediately from your computer system.

Spark not working with Hadoop 4mc compression

2018-12-19 Thread Abhijeet Kumar

Hello,

I’m using 4mc compression in my Hadoop and when I’m reading file from hdfs, 
it’s throwing error.

https://github.com/carlomedas/4mc

I’m doing simple query in 
sc.textFile("/store.csv").getNumPartitions

Error:
java.lang.RuntimeException: Error in configuring object
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
  at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
  at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:187)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
  ... 49 elided
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.IllegalArgumentException: Compression codec 
com.hadoop.compression.lzo.LzoCodec not found.
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
  ... 63 more
Caused by: java.lang.IllegalArgumentException: Compression codec 
com.hadoop.compression.lzo.LzoCodec not found.
  at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
  at 
org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:180)
  at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
  ... 68 more
Caused by: java.lang.ClassNotFoundException: Class 
com.hadoop.compression.lzo.LzoCodec not found
  at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
  at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
  ... 70 more


Thank you,
Abhijeet Kumar

Re: Read Time from a remote data source

2018-12-19 Thread Jiaan Geng

First, Spark worker not have the ability to compute.In fact,executor is
responsible for computation.
Executor running tasks is distributed by driver.
Each Task just read some section of data in normal, but the stage have only
one partition.
IF your operators not contains the operator that will pull middle result
from each task, like collect or show，driver will not store any data.
Each Executor not store the end result in memory by default, unless your
operator contains the operator that cache data to memory, like cache or
persist.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark SQL]use zstd, No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD

2018-12-19 Thread 李斌松

Import parquet-hadoop-bundle jar. into the spark hive project When you
compress data using zstd, you may load it preferentially from the
parquet-hadoop-bundle, and you canundefinedt find the enum constant
parquet.hadoop.metadata.CompressionCodecName.ZSTD

>
> 18/12/20 10:35:28 ERROR Executor: Exception in task 0.2 in stage 1.0 (TID
> 5)
> org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.IllegalArgumentException: No enum constant
> parquet.hadoop.metadata.CompressionCodecName.ZSTD
> /parquet
> at
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:243)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:175)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:174)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:406)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:412)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: No enum constant
> parquet.hadoop.metadata.CompressionCodecName.ZSTD
> at java.lang.Enum.valueOf(Enum.java:238)
> at
> parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:24)
> at
> parquet.hadoop.metadata.CompressionCodecName.fromConf(CompressionCodecName.java:34)
> at
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.initializeSerProperties(ParquetRecordWriterWrapper.java:94)
> at
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:61)
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:125)
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:114)
> at
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
> at
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
> ... 15 more

Fwd: Train multiple machine learning models in parallel

2018-12-19 Thread Pola Yao

Hi Comminuty,

I have a 1T dataset which contains records for  50 users. Each user has 20G
data averagely.

I wanted to use spark to train a machine learning model (e.g., XGBoost tree
model) for each user. Ideally, the result should be 50 models. However,
it'd be infeasible to submit 50 spark jobs through 'spark-submit'.

The model parameters and feature engineering steps for each user's data
would be exactly same, I am wondering if there is a way to train this 50
models in parallel?

Thanks!

Re: question about barrier execution mode in Spark 2.4.0

2018-12-19 Thread Xiangrui Meng

On Mon, Nov 12, 2018 at 7:33 AM Joe  wrote:

> Hello,
> I was reading Spark 2.4.0 release docs and I'd like to find out more
> about barrier execution mode.
> In particular I'd like to know what happens when number of partitions
> exceeds number of nodes (which I think is allowed, Spark tuning doc
> mentions that)?
>

The barrier execution mode is different. It needs to run tasks for all
partitions together. So when the number of partitions is greater than
number of nodes, it will wait until more nodes are available and print
warning messages.

> Does Spark guarantee that all tasks process all partitions
> simultaneously?

They will start all together. We provide a barrier()

method in the task scope to help simple coordination among tasks.

> If not then how does barrier mode handle partitions that
> are waiting to be processed?
> If there are partitions waiting to be processed then I don't think it's
> possible to send all data from given stage to a DL process, even when
> using barrier mode?
> Thanks a lot,
>
> Joe
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-19 Thread Mich Talebzadeh

Thanks Sam. Looks interesting. I will have a look in details and let you
know.

Best,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 19 Dec 2018 at 08:36, Sam Elamin  wrote:

> Hi Mich
>
> I wrote a connector to make it easier to connect Bigquery and Spark
>
> Have a look here https://github.com/samelamin/spark-bigquery/
>
> Your feedback is always welcome
>
> Kind Regards
> Sam
>
> On Tue, Dec 18, 2018 at 7:46 PM Mich Talebzadeh 
> wrote:
>
>> Thanks Jorn. I will try that. Requires installing sbt etc on ephemeral
>> compute server in Google Cloud to built an uber jar file.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 18 Dec 2018 at 11:16, Jörn Franke  wrote:
>>
>>> Maybe the guava version in your spark lib folder is not compatible (if
>>> your Spark version has a guava library)? In this case i propose to create a
>>> fat/uber jar potentially with a shaded guava dependency.
>>>
>>> Am 18.12.2018 um 11:26 schrieb Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
>>> Hi,
>>>
>>> I am writing a small test code in spark-shell with attached jar
>>> dependencies
>>>
>>> spark-shell --jars
>>> /home/hduser/jars/bigquery-connector-0.13.4-hadoop3.jar,/home/hduser/jars/gcs-connector-1.9.4-hadoop3.jar,/home/hduser/jars/other/guava-19.0.jar,/home/hduser/jars/google-api-client-1.4.1-beta.jar,/home/hduser/jars/google-api-client-json-1.2.3-alpha.jar,/home/hduser/jars/google-api-services-bigquery-v2-rev20181202-1.27.0.jar
>>>
>>>  to read an already existing table in Google BigQuery as follows:
>>>
>>> import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
>>> import com.google.cloud.hadoop.io.bigquery.BigQueryFileFormat
>>> import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
>>> import
>>> com.google.cloud.hadoop.io.bigquery.output.BigQueryOutputConfiguration
>>> import
>>> com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat
>>> import com.google.gson.JsonObject
>>> import org.apache.hadoop.io.LongWritable
>>> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
>>> // Assumes you have a spark context (sc) -- running from spark-shell
>>> REPL.
>>> // Marked as transient since configuration is not Serializable. This
>>> should
>>> // only be necessary in spark-shell REPL.
>>> @transient
>>> val conf = sc.hadoopConfiguration
>>> // Input parameters.
>>> val fullyQualifiedInputTableId = "axial-glow-224522.accounts.ll_18740868"
>>> val projectId = conf.get("fs.gs.project.id")
>>> val bucket = conf.get("fs.gs.system.bucket")
>>> // Input configuration.
>>> conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
>>> conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, bucket)
>>> BigQueryConfiguration.configureBigQueryInput(conf,
>>> fullyQualifiedInputTableId)
>>>
>>> The problem I have is that even after loading jars with spark-shell
>>> --jar
>>>
>>> I am getting the following error at the last line
>>>
>>> scala> BigQueryConfiguration.configureBigQueryInput(conf,
>>> fullyQualifiedInputTableId)
>>>
>>> java.lang.NoSuchMethodError:
>>> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
>>>   at
>>> com.google.cloud.hadoop.io.bigquery.BigQueryStrings.parseTableReference(BigQueryStrings.java:68)
>>>   at
>>> com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration.configureBigQueryInput(BigQueryConfiguration.java:260)
>>>   ... 49 elided
>>>
>>> It says it cannot find method
>>>
>>> java.lang.NoSuchMethodError:
>>> com.google.common.base.Preconditions.checkArgument
>>>
>>> but I checked it and it is in the following jar file
>>>
>>> jar tvf guava-19.0.jar| grep common.base.Preconditions
>>>   5249 Wed Dec 09 15:58:14 UTC 2015
>>> com/google/common/base/Preconditions.class
>>>
>>> I have used different version of guava jar files but none works!
>>>
>>> The code is based on the following:
>>>
>>>
>

Re: Read Time from a remote data source

2018-12-19 Thread swastik mittal

I am running a model where the workers should not have the data stored in
them. They are only for execution purpose. The other cluster (its just a
single node) which I am receiving data from is just acting as a file server,
for which I could have used any other way like nfs or ftp. So I went with
hdfs so that it would not have to worry about partitioning of data and also
it does not effect my experiment. So I just had this question that does
spark worker read all the data before computation once its first task start,
and then distribute it among the workers memory or do they read it chunk by
chunk, by each worker and then store the end result in memory to send the
final result.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[no subject]

2018-12-19 Thread Daniel O' Shaughnessy

unsubscribe

[Spark Core] Support for parquet column indexes

2018-12-19 Thread Kamil Krzysztof Krynicki

Hello,

Recently there has been an addition to the parquet files. Namely, the column 
indexes.

See: 
https://stackoverflow.com/questions/26909543/index-in-parquet/40714337#40714337

Available since parquet encoder 1.11, parquet format 2.5.

It seems to improve the IO performance by an order of magnitude in certain 
scenarios, which is simply fantastic.

My question are:
- are there any plans to include it in upcoming spark releases? Could you 
direct me to an issue, if such exists?
- is not, could you suggest a way to at least write parquet files in the new 
format and worry about the optimized reads later? Would simply forcing the 
parquet dependencies to the said versions be enough?

Thank you!

Cheers,
Kamil Krynicki

Spark Kafka Streaming with Offset Gaps

2018-12-19 Thread Rishabh Pugalia

I have an app that uses Kafka Streaming to pull data from `input` topic and
push to `output` topic with `processing.guarantee=exactly_once`. Due to
`exactly_once` gaps (transaction markers) are created in Kafka. Let's call
this app `kafka-streamer`.

Now I've another app that listens to this output topic (actually they are
multiple topics with a Pattern/Regex) and processes the data using
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html.
Let's call this app `spark-streamer`.

Due to the gaps, the first thing that happens is spark streaming fails. To
fix this I enabled `spark.streaming.kafka.allowNonConsecutiveOffsets=true`
in the spark config before creating the StreamingContext. Now let's look at
the issues that were faced when I start `spark-streamer` (I also went
through some of the spark-streaming-kafka code in the limited amount of
time I had):

1. Once `spark-streamer` starts if there are unconsumed offsets present in
the topic partition, it does poll them but won't process (create RDDs)
until some new message is pushed to the topic partition after the app is
started. Code:
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala#L160
- I can see we poll the data but I'm not sure where the code is to process
it. But anyway, when I run the app I'm pretty sure the data doesn't get
processed (but it does get polled in `compactedStart()`) until
`compactedNext()` is called.
2. In `compactedNext()` if no data is polled within 120s (default timeout),
we throw an exception and the my app literally crashes. Code:
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala#L178
- Why do we throw an exception and not keep polling just like a normal
KafkaConsumer would do/behave ?

Would be of great help if somebody can help me out with the 2 questions
listed above!

--
Thanks and Best Regards,
Rishabh

Spark Kafka Streaming with Offset Gaps

2018-12-19 Thread Rishabh Pugalia

1. Once `spark-streamer` starts if there are unconsumed offsets present in
the topic partition, it does poll them but won't process (create RDDs)
until some new message is pushed to the topic partition after the app is
started. Code:
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala#L160
-
I can see we poll the data but I'm not sure where the code is to process
it. But anyway, when I run the app I'm pretty sure the data doesn't get
processed (but it does get polled in `compactedStart()`) until
`compactedNext()` is called.
2. In `compactedNext()` if no data is polled within 120s (default timeout),
we throw an exception and the my app literally crashes. Code:
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala#L178
-
Why do we throw an exception and not keep polling just like a normal
KafkaConsumer would do/behave ?

Would be of great help if somebody can help me out with the 2 questions
listed above!

--
Thanks and Best Regards,
Rishabh

Re: Spark 2.2.1 - Operation not allowed: alter table replace columns

2018-12-19 Thread Jiaan Geng

This SQL syntax is not supported now！Please use ALTER TABLE ... CHANGE COLUMN
.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Multiple sessions in one application?

2018-12-19 Thread Jean Georges Perrin

Hi there,

I was curious of what use cases would drive the use of newSession() (as in 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html#newSession--
 
).

I understand that you get a cleaner slate, but why would you need it?

Thanks,

jg

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-19 Thread Sam Elamin

Hi Mich

I wrote a connector to make it easier to connect Bigquery and Spark

Have a look here https://github.com/samelamin/spark-bigquery/

Your feedback is always welcome

Kind Regards
Sam

On Tue, Dec 18, 2018 at 7:46 PM Mich Talebzadeh 
wrote:

> Thanks Jorn. I will try that. Requires installing sbt etc on ephemeral
> compute server in Google Cloud to built an uber jar file.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 18 Dec 2018 at 11:16, Jörn Franke  wrote:
>
>> Maybe the guava version in your spark lib folder is not compatible (if
>> your Spark version has a guava library)? In this case i propose to create a
>> fat/uber jar potentially with a shaded guava dependency.
>>
>> Am 18.12.2018 um 11:26 schrieb Mich Talebzadeh > >:
>>
>> Hi,
>>
>> I am writing a small test code in spark-shell with attached jar
>> dependencies
>>
>> spark-shell --jars
>> /home/hduser/jars/bigquery-connector-0.13.4-hadoop3.jar,/home/hduser/jars/gcs-connector-1.9.4-hadoop3.jar,/home/hduser/jars/other/guava-19.0.jar,/home/hduser/jars/google-api-client-1.4.1-beta.jar,/home/hduser/jars/google-api-client-json-1.2.3-alpha.jar,/home/hduser/jars/google-api-services-bigquery-v2-rev20181202-1.27.0.jar
>>
>>  to read an already existing table in Google BigQuery as follows:
>>
>> import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
>> import com.google.cloud.hadoop.io.bigquery.BigQueryFileFormat
>> import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
>> import
>> com.google.cloud.hadoop.io.bigquery.output.BigQueryOutputConfiguration
>> import
>> com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat
>> import com.google.gson.JsonObject
>> import org.apache.hadoop.io.LongWritable
>> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
>> // Assumes you have a spark context (sc) -- running from spark-shell REPL.
>> // Marked as transient since configuration is not Serializable. This
>> should
>> // only be necessary in spark-shell REPL.
>> @transient
>> val conf = sc.hadoopConfiguration
>> // Input parameters.
>> val fullyQualifiedInputTableId = "axial-glow-224522.accounts.ll_18740868"
>> val projectId = conf.get("fs.gs.project.id")
>> val bucket = conf.get("fs.gs.system.bucket")
>> // Input configuration.
>> conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
>> conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, bucket)
>> BigQueryConfiguration.configureBigQueryInput(conf,
>> fullyQualifiedInputTableId)
>>
>> The problem I have is that even after loading jars with spark-shell --jar
>>
>> I am getting the following error at the last line
>>
>> scala> BigQueryConfiguration.configureBigQueryInput(conf,
>> fullyQualifiedInputTableId)
>>
>> java.lang.NoSuchMethodError:
>> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
>>   at
>> com.google.cloud.hadoop.io.bigquery.BigQueryStrings.parseTableReference(BigQueryStrings.java:68)
>>   at
>> com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration.configureBigQueryInput(BigQueryConfiguration.java:260)
>>   ... 49 elided
>>
>> It says it cannot find method
>>
>> java.lang.NoSuchMethodError:
>> com.google.common.base.Preconditions.checkArgument
>>
>> but I checked it and it is in the following jar file
>>
>> jar tvf guava-19.0.jar| grep common.base.Preconditions
>>   5249 Wed Dec 09 15:58:14 UTC 2015
>> com/google/common/base/Preconditions.class
>>
>> I have used different version of guava jar files but none works!
>>
>> The code is based on the following:
>>
>>
>> https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>

[SPARK SQL] Difference between 'Hive on spark' and Spark SQL

Spark not working with Hadoop 4mc compression

Re: Read Time from a remote data source

[Spark SQL]use zstd, No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD

Fwd: Train multiple machine learning models in parallel

Re: question about barrier execution mode in Spark 2.4.0

Re: Spark Scala reading from Google Cloud BigQuery table throws error

Re: Read Time from a remote data source

[no subject]

[Spark Core] Support for parquet column indexes

Spark Kafka Streaming with Offset Gaps

Spark Kafka Streaming with Offset Gaps

Re: Spark 2.2.1 - Operation not allowed: alter table replace columns

Multiple sessions in one application?

Re: Spark Scala reading from Google Cloud BigQuery table throws error

15 matches

Site Navigation

Mail list logo

Footer information