Re: Why spark-submit works with package not with jar

2020-10-20 Thread Wim Van Leuven
Sean,

Problem with the -packages is that in enterprise settings security might
not allow the data environment to link to the internet or even the internal
proxying artefect repository.

Also, wasn't uberjars an antipattern? For some reason I don't like them...

Kind regards
-wim



On Wed, 21 Oct 2020 at 01:06, Mich Talebzadeh 
wrote:

> Thanks again all.
>
> Anyway as Nicola suggested I used the trench war approach to sort this out
> by just using jars and working out their dependencies in ~/.ivy2/jars
> directory using grep -lRi  :)
>
>
> This now works with just using jars (new added ones in grey) after
> resolving the dependencies
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master yarn \
>
> --deploy-mode client \
>
> --conf spark.executor.memoryOverhead=3000 \
>
> --class org.apache.spark.repl.Main \
>
> --name "my own Spark shell on Yarn" "$@" \
>
> --driver-class-path /home/hduser/jars/ddhybrid.jar \
>
> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>
>/home/hduser/jars/ddhybrid.jar, \
>
>
>  /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \
>
>
>  
> /home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar,
> \
>
>
>  /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \
>
>
>  /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \
>
>
> /home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \
>
>
>  
> /home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar,
> \
>
>
>  
> /home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar,
> \
>
>/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \
>
>
> Compared to using the package itself as before
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master yarn \
>
> --deploy-mode client \
>
> --conf spark.executor.memoryOverhead=3000 \
>
> --class org.apache.spark.repl.Main \
>
> --name "my own Spark shell on Yarn" "$@" \
>
> --driver-class-path /home/hduser/jars/ddhybrid.jar \
>
> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>
>/home/hduser/jars/ddhybrid.jar \
>
>
> --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>
>
>
> I think as Sean suggested this approach may or may not work (a manual
> process) and if jars change, the whole thing has to be re-evaluated adding
> to the complexity.
>
>
> Cheers
>
>
> On Tue, 20 Oct 2020 at 23:01, Sean Owen  wrote:
>
>> Rather, let --packages (via Ivy) worry about them, because they tell Ivy
>> what they need.
>> There's no 100% guarantee that conflicting dependencies are resolved in a
>> way that works in every single case, which you run into sometimes when
>> using incompatible libraries, but yes this is the point of --packages and
>> Ivy.
>>
>> On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks again all.
>>>
>>> Hi Sean,
>>>
>>> As I understood from your statement, you are suggesting just use
>>> --packages without worrying about individual jar dependencies?
>>>

>>


Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread lec ssmi
Structured streaming's  bottom layer also uses a micro-batch
mechanism. It seems that the first batch is slower than  the latter, I also
often encounter this problem. It feels related to the division of batches.
   Other the other hand, spark's batch size is usually bigger than flume
transaction bache size.


KhajaAsmath Mohammed  于2020年10月21日周三 下午12:19写道:

> Yes. Changing back to latest worked but I still see the slowness compared
> to flume.
>
> Sent from my iPhone
>
> On Oct 20, 2020, at 10:21 PM, lec ssmi  wrote:
>
> 
> Do you start your application  with  chasing the early Kafka data  ?
>
> Lalwani, Jayesh  于2020年10月21日周三 上午2:19写道:
>
>> Are you getting any output? Streaming jobs typically run forever, and
>> keep processing data as it comes in the input. If a streaming job is
>> working well, it will typically generate output at a certain cadence
>>
>>
>>
>> *From: *KhajaAsmath Mohammed 
>> *Date: *Tuesday, October 20, 2020 at 1:23 PM
>> *To: *"user @spark" 
>> *Subject: *[EXTERNAL] Spark Structured streaming - Kakfa - slowness with
>> query 0
>>
>>
>>
>> *CAUTION*: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>>
>> Hi,
>>
>>
>>
>> I have started using spark structured streaming for reading data from
>> kaka and the job is very slow. Number of output rows keeps increasing in
>> query 0 and the job is running forever. any suggestions for this please?
>>
>>
>>
>> 
>>
>>
>>
>> Thanks,
>>
>> Asmath
>>
>


Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread KhajaAsmath Mohammed
Yes. Changing back to latest worked but I still see the slowness compared to 
flume. 

Sent from my iPhone

> On Oct 20, 2020, at 10:21 PM, lec ssmi  wrote:
> 
> 
> Do you start your application  with  chasing the early Kafka data  ? 
> 
> Lalwani, Jayesh  于2020年10月21日周三 上午2:19写道:
>> Are you getting any output? Streaming jobs typically run forever, and keep 
>> processing data as it comes in the input. If a streaming job is working 
>> well, it will typically generate output at a certain cadence
>> 
>>  
>> 
>> From: KhajaAsmath Mohammed 
>> Date: Tuesday, October 20, 2020 at 1:23 PM
>> To: "user @spark" 
>> Subject: [EXTERNAL] Spark Structured streaming - Kakfa - slowness with query >> 0
>> 
>>  
>> 
>> CAUTION: This email originated from outside of the organization. Do not 
>> click links or open attachments unless you can confirm the sender and know 
>> the content is safe.
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> I have started using spark structured streaming for reading data from kaka 
>> and the job is very slow. Number of output rows keeps increasing in query 0 
>> and the job is running forever. any suggestions for this please? 
>> 
>>  
>> 
>> 
>> 
>>  
>> 
>> Thanks,
>> 
>> Asmath


Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread lec ssmi
Do you start your application  with  chasing the early Kafka data  ?

Lalwani, Jayesh  于2020年10月21日周三 上午2:19写道:

> Are you getting any output? Streaming jobs typically run forever, and keep
> processing data as it comes in the input. If a streaming job is working
> well, it will typically generate output at a certain cadence
>
>
>
> *From: *KhajaAsmath Mohammed 
> *Date: *Tuesday, October 20, 2020 at 1:23 PM
> *To: *"user @spark" 
> *Subject: *[EXTERNAL] Spark Structured streaming - Kakfa - slowness with
> query 0
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Hi,
>
>
>
> I have started using spark structured streaming for reading data from kaka
> and the job is very slow. Number of output rows keeps increasing in query 0
> and the job is running forever. any suggestions for this please?
>
>
>
>
>
> Thanks,
>
> Asmath
>


??The decimal result is incorrectly enlarged by 100 times??

2020-10-20 Thread ??????
Hi ,
I have came across a problem about correctness of spark decimal, and I have 
researched for it a few days. This problem is very curious.

My spark version is spark 2.3.1

I have a sql like this:
Create table table_S stored as orc as
Select a*b*c from table_a
Union all
Select d from table_B
Union all
Select e from table_C

Column a b c are all decimal(38,4)
Column d is also decimal(38,4)
Column e is also decimal(38,4)

The result of this sql is wrong ,The result is 100 times greater than the 
correct value.

The weird thing is ??If I delete ??create table?? clause ?? the result is 
correct.
And I change the order of union , the result is also correct.
E.g

Create table table_S stored as orc as
Select d from table_B
Union all
Select a*b*c from table_a
Union all
Select e from table_C


Besides , spark 2.3.2 can gave correct result in this case. But I checked all 
the patch of 2.3.2, can not find which patch solve this problem.


Can anyone gave some Help? Has anyone encountered the same problem??

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
Thanks again all.

Anyway as Nicola suggested I used the trench war approach to sort this out
by just using jars and working out their dependencies in ~/.ivy2/jars
directory using grep -lRi  :)


This now works with just using jars (new added ones in grey) after
resolving the dependencies


${SPARK_HOME}/bin/spark-submit \

--master yarn \

--deploy-mode client \

--conf spark.executor.memoryOverhead=3000 \

--class org.apache.spark.repl.Main \

--name "my own Spark shell on Yarn" "$@" \

--driver-class-path /home/hduser/jars/ddhybrid.jar \

--jars /home/hduser/jars/spark-bigquery-latest.jar, \

   /home/hduser/jars/ddhybrid.jar, \


 /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \


 
/home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar,
\


 /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \


 /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \


/home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \


 
/home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar,
\


 
/home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar,
\

   /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \


Compared to using the package itself as before


${SPARK_HOME}/bin/spark-submit \

--master yarn \

--deploy-mode client \

--conf spark.executor.memoryOverhead=3000 \

--class org.apache.spark.repl.Main \

--name "my own Spark shell on Yarn" "$@" \

--driver-class-path /home/hduser/jars/ddhybrid.jar \

--jars /home/hduser/jars/spark-bigquery-latest.jar, \

   /home/hduser/jars/ddhybrid.jar \

--packages com.github.samelamin:spark-bigquery_2.11:0.2.6



I think as Sean suggested this approach may or may not work (a manual
process) and if jars change, the whole thing has to be re-evaluated adding
to the complexity.


Cheers


On Tue, 20 Oct 2020 at 23:01, Sean Owen  wrote:

> Rather, let --packages (via Ivy) worry about them, because they tell Ivy
> what they need.
> There's no 100% guarantee that conflicting dependencies are resolved in a
> way that works in every single case, which you run into sometimes when
> using incompatible libraries, but yes this is the point of --packages and
> Ivy.
>
> On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh 
> wrote:
>
>> Thanks again all.
>>
>> Hi Sean,
>>
>> As I understood from your statement, you are suggesting just use
>> --packages without worrying about individual jar dependencies?
>>
>>>
>


Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
Rather, let --packages (via Ivy) worry about them, because they tell Ivy
what they need.
There's no 100% guarantee that conflicting dependencies are resolved in a
way that works in every single case, which you run into sometimes when
using incompatible libraries, but yes this is the point of --packages and
Ivy.

On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh 
wrote:

> Thanks again all.
>
> Hi Sean,
>
> As I understood from your statement, you are suggesting just use
> --packages without worrying about individual jar dependencies?
>
>>



Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
or just use mvn or sbt to create an Uber jar file.




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Oct 2020 at 22:43, Mich Talebzadeh 
wrote:

> Thanks again all.
>
> Hi Sean,
>
> As I understood from your statement, you are suggesting just use
> --packages without worrying about individual jar dependencies?
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Oct 2020 at 22:34, Sean Owen  wrote:
>
>> From the looks of it, it's the com.google.http-client ones. But there may
>> be more. You should not have to reason about this. That's why you let Maven
>> / Ivy resolution figure it out. It is not true that everything in .ivy2 is
>> on the classpath.
>>
>> On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Nicolas,
>>>
>>> I removed ~/.iv2 and reran the spark job with the package included (the
>>> one working)
>>>
>>> Under ~/.ivy/jars I Have 37 jar files, including the one that I had
>>> before.
>>>
>>> /home/hduser/.ivy2/jars> ls
>>> com.databricks_spark-avro_2.11-4.0.0.jar
>>>  com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar
>>> com.google.oauth-client_google-oauth-client-1.24.1.jar
>>> org.checkerframework_checker-qual-2.5.2.jar
>>> com.fasterxml.jackson.core_jackson-core-2.9.2.jar
>>> com.google.cloud.bigdataoss_gcsio-1.9.4.jar
>>> com.google.oauth-client_google-oauth-client-java6-1.24.1.jar
>>> org.codehaus.jackson_jackson-core-asl-1.9.13.jar
>>> com.github.samelamin_spark-bigquery_2.11-0.2.6.jar
>>>  com.google.cloud.bigdataoss_util-1.9.4.jar
>>>  commons-codec_commons-codec-1.6.jar
>>>  org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
>>> com.google.api-client_google-api-client-1.24.1.jar
>>>  com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar
>>> commons-logging_commons-logging-1.1.1.jar
>>>  org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
>>> com.google.api-client_google-api-client-jackson2-1.24.1.jar
>>> com.google.code.findbugs_jsr305-3.0.2.jar
>>> com.thoughtworks.paranamer_paranamer-2.3.jar
>>> org.slf4j_slf4j-api-1.7.5.jar
>>> com.google.api-client_google-api-client-java6-1.24.1.jar
>>>  com.google.errorprone_error_prone_annotations-2.1.3.jar
>>> joda-time_joda-time-2.9.3.jar
>>>  org.tukaani_xz-1.0.jar
>>> com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar
>>> com.google.guava_guava-26.0-jre.jar
>>> org.apache.avro_avro-1.7.6.jar
>>> org.xerial.snappy_snappy-java-1.0.5.jar
>>> com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar
>>>  com.google.http-client_google-http-client-1.24.1.jar
>>>  org.apache.commons_commons-compress-1.4.1.jar
>>> com.google.auto.value_auto-value-annotations-1.6.2.jar
>>>  com.google.http-client_google-http-client-jackson2-1.24.1.jar
>>> org.apache.httpcomponents_httpclient-4.0.1.jar
>>> com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar
>>> com.google.j2objc_j2objc-annotations-1.1.jar
>>>  org.apache.httpcomponents_httpcore-4.0.1.jar
>>>
>>> I don't think I need to add all of these to spark-submit --jars list. Is
>>> there a way I can find out which dependency is missing
>>>
>>> This is the error I am getting when I use the jar file
>>> * com.github.samelamin_spark-bigquery_2.11-0.2.6.jar* instead of the
>>> package *com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>
>>> java.lang.NoClassDefFoundError:
>>> com/google/api/client/http/HttpRequestInitializer
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>   ... 76 elided
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.google.api.client.http.HttpRequestInitializer
>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>>
>>> Thanks
>>

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages
without worrying about individual jar dependencies?



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Oct 2020 at 22:34, Sean Owen  wrote:

> From the looks of it, it's the com.google.http-client ones. But there may
> be more. You should not have to reason about this. That's why you let Maven
> / Ivy resolution figure it out. It is not true that everything in .ivy2 is
> on the classpath.
>
> On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh 
> wrote:
>
>> Hi Nicolas,
>>
>> I removed ~/.iv2 and reran the spark job with the package included (the
>> one working)
>>
>> Under ~/.ivy/jars I Have 37 jar files, including the one that I had
>> before.
>>
>> /home/hduser/.ivy2/jars> ls
>> com.databricks_spark-avro_2.11-4.0.0.jar
>>  com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar
>> com.google.oauth-client_google-oauth-client-1.24.1.jar
>> org.checkerframework_checker-qual-2.5.2.jar
>> com.fasterxml.jackson.core_jackson-core-2.9.2.jar
>> com.google.cloud.bigdataoss_gcsio-1.9.4.jar
>> com.google.oauth-client_google-oauth-client-java6-1.24.1.jar
>> org.codehaus.jackson_jackson-core-asl-1.9.13.jar
>> com.github.samelamin_spark-bigquery_2.11-0.2.6.jar
>>  com.google.cloud.bigdataoss_util-1.9.4.jar
>>  commons-codec_commons-codec-1.6.jar
>>  org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
>> com.google.api-client_google-api-client-1.24.1.jar
>>  com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar
>> commons-logging_commons-logging-1.1.1.jar
>>  org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
>> com.google.api-client_google-api-client-jackson2-1.24.1.jar
>> com.google.code.findbugs_jsr305-3.0.2.jar
>> com.thoughtworks.paranamer_paranamer-2.3.jar
>> org.slf4j_slf4j-api-1.7.5.jar
>> com.google.api-client_google-api-client-java6-1.24.1.jar
>>  com.google.errorprone_error_prone_annotations-2.1.3.jar
>> joda-time_joda-time-2.9.3.jar
>>  org.tukaani_xz-1.0.jar
>> com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar
>> com.google.guava_guava-26.0-jre.jar
>> org.apache.avro_avro-1.7.6.jar
>> org.xerial.snappy_snappy-java-1.0.5.jar
>> com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar
>>  com.google.http-client_google-http-client-1.24.1.jar
>>  org.apache.commons_commons-compress-1.4.1.jar
>> com.google.auto.value_auto-value-annotations-1.6.2.jar
>>  com.google.http-client_google-http-client-jackson2-1.24.1.jar
>> org.apache.httpcomponents_httpclient-4.0.1.jar
>> com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar
>> com.google.j2objc_j2objc-annotations-1.1.jar
>>  org.apache.httpcomponents_httpcore-4.0.1.jar
>>
>> I don't think I need to add all of these to spark-submit --jars list. Is
>> there a way I can find out which dependency is missing
>>
>> This is the error I am getting when I use the jar file
>> * com.github.samelamin_spark-bigquery_2.11-0.2.6.jar* instead of the
>> package *com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>
>> java.lang.NoClassDefFoundError:
>> com/google/api/client/http/HttpRequestInitializer
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>   ... 76 elided
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.api.client.http.HttpRequestInitializer
>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>>
>> Thanks
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Oct 2020 at 20:09, Nicolas Paris 
>> wrote:
>>
>>> once you got the jars from --package in the ~/.ivy2 folder you can then
>>> add the list to --jars . in this way there is no missing dependency.
>>>
>>>
>>> ayan guha  writes:
>>>
>>> > Hi
>>> >
>>> > One way to think of this is --packages is better when you have third
>>> party
>>> > dependency and --jars is better when you have custom in-hou

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
>From the looks of it, it's the com.google.http-client ones. But there may
be more. You should not have to reason about this. That's why you let Maven
/ Ivy resolution figure it out. It is not true that everything in .ivy2 is
on the classpath.

On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh 
wrote:

> Hi Nicolas,
>
> I removed ~/.iv2 and reran the spark job with the package included (the
> one working)
>
> Under ~/.ivy/jars I Have 37 jar files, including the one that I had
> before.
>
> /home/hduser/.ivy2/jars> ls
> com.databricks_spark-avro_2.11-4.0.0.jar
>  com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar
> com.google.oauth-client_google-oauth-client-1.24.1.jar
> org.checkerframework_checker-qual-2.5.2.jar
> com.fasterxml.jackson.core_jackson-core-2.9.2.jar
> com.google.cloud.bigdataoss_gcsio-1.9.4.jar
> com.google.oauth-client_google-oauth-client-java6-1.24.1.jar
> org.codehaus.jackson_jackson-core-asl-1.9.13.jar
> com.github.samelamin_spark-bigquery_2.11-0.2.6.jar
>  com.google.cloud.bigdataoss_util-1.9.4.jar
>  commons-codec_commons-codec-1.6.jar
>  org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
> com.google.api-client_google-api-client-1.24.1.jar
>  com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar
> commons-logging_commons-logging-1.1.1.jar
>  org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
> com.google.api-client_google-api-client-jackson2-1.24.1.jar
> com.google.code.findbugs_jsr305-3.0.2.jar
> com.thoughtworks.paranamer_paranamer-2.3.jar
> org.slf4j_slf4j-api-1.7.5.jar
> com.google.api-client_google-api-client-java6-1.24.1.jar
>  com.google.errorprone_error_prone_annotations-2.1.3.jar
> joda-time_joda-time-2.9.3.jar
>  org.tukaani_xz-1.0.jar
> com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar
> com.google.guava_guava-26.0-jre.jar
> org.apache.avro_avro-1.7.6.jar
> org.xerial.snappy_snappy-java-1.0.5.jar
> com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar
>  com.google.http-client_google-http-client-1.24.1.jar
>  org.apache.commons_commons-compress-1.4.1.jar
> com.google.auto.value_auto-value-annotations-1.6.2.jar
>  com.google.http-client_google-http-client-jackson2-1.24.1.jar
> org.apache.httpcomponents_httpclient-4.0.1.jar
> com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar
> com.google.j2objc_j2objc-annotations-1.1.jar
>  org.apache.httpcomponents_httpcore-4.0.1.jar
>
> I don't think I need to add all of these to spark-submit --jars list. Is
> there a way I can find out which dependency is missing
>
> This is the error I am getting when I use the jar file
> * com.github.samelamin_spark-bigquery_2.11-0.2.6.jar* instead of the
> package *com.github.samelamin:spark-bigquery_2.11:0.2.6*
>
> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>   ... 76 elided
> Caused by: java.lang.ClassNotFoundException:
> com.google.api.client.http.HttpRequestInitializer
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Oct 2020 at 20:09, Nicolas Paris 
> wrote:
>
>> once you got the jars from --package in the ~/.ivy2 folder you can then
>> add the list to --jars . in this way there is no missing dependency.
>>
>>
>> ayan guha  writes:
>>
>> > Hi
>> >
>> > One way to think of this is --packages is better when you have third
>> party
>> > dependency and --jars is better when you have custom in-house built
>> jars.
>> >
>> > On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <
>> mich.talebza...@gmail.com>
>> > wrote:
>> >
>> >> Thanks Sean and Russell. Much appreciated.
>> >>
>> >> Just to clarify recently I had issues with different versions of Google
>> >> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> >> These used to work a year and half ago using Google Dataproc compute
>> >> engines (comes with Spark preloaded) and I could create an Uber jar
>> file.
>> >>
>> >> Unfortunately this has become problematic now so tried to use
>> spark-submit
>> >> instead as follows:
>> >>
>> >> ${SPARK_HOME}/bin/spark-submit \
>> >> --master yarn \
>> >> --deploy-mode client \
>> >> --conf spark.executor.memoryOverhead=3000 \
>> >> --class o

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Nicolas Paris
you can proceed step by step.

> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer

I would run `grep -lRi HttpRequestInitializer` in the ivy2 folder to
spot the jar containing that class. after several other class not found you
should succeed

Mich Talebzadeh  writes:

> Hi Nicolas,
>
> I removed ~/.iv2 and reran the spark job with the package included (the one
> working)
>
> Under ~/.ivy/jars I Have 37 jar files, including the one that I had before.
>
> /home/hduser/.ivy2/jars> ls
> com.databricks_spark-avro_2.11-4.0.0.jar
>  com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar
> com.google.oauth-client_google-oauth-client-1.24.1.jar
> org.checkerframework_checker-qual-2.5.2.jar
> com.fasterxml.jackson.core_jackson-core-2.9.2.jar
> com.google.cloud.bigdataoss_gcsio-1.9.4.jar
> com.google.oauth-client_google-oauth-client-java6-1.24.1.jar
> org.codehaus.jackson_jackson-core-asl-1.9.13.jar
> com.github.samelamin_spark-bigquery_2.11-0.2.6.jar
>  com.google.cloud.bigdataoss_util-1.9.4.jar
>  commons-codec_commons-codec-1.6.jar
>  org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
> com.google.api-client_google-api-client-1.24.1.jar
>  com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar
> commons-logging_commons-logging-1.1.1.jar
>  org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
> com.google.api-client_google-api-client-jackson2-1.24.1.jar
> com.google.code.findbugs_jsr305-3.0.2.jar
> com.thoughtworks.paranamer_paranamer-2.3.jar
> org.slf4j_slf4j-api-1.7.5.jar
> com.google.api-client_google-api-client-java6-1.24.1.jar
>  com.google.errorprone_error_prone_annotations-2.1.3.jar
> joda-time_joda-time-2.9.3.jar
>  org.tukaani_xz-1.0.jar
> com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar
> com.google.guava_guava-26.0-jre.jar
> org.apache.avro_avro-1.7.6.jar
> org.xerial.snappy_snappy-java-1.0.5.jar
> com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar
>  com.google.http-client_google-http-client-1.24.1.jar
>  org.apache.commons_commons-compress-1.4.1.jar
> com.google.auto.value_auto-value-annotations-1.6.2.jar
>  com.google.http-client_google-http-client-jackson2-1.24.1.jar
> org.apache.httpcomponents_httpclient-4.0.1.jar
> com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar
> com.google.j2objc_j2objc-annotations-1.1.jar
>  org.apache.httpcomponents_httpcore-4.0.1.jar
>
> I don't think I need to add all of these to spark-submit --jars list. Is
> there a way I can find out which dependency is missing
>
> This is the error I am getting when I use the jar file
> * com.github.samelamin_spark-bigquery_2.11-0.2.6.jar* instead of the
> package *com.github.samelamin:spark-bigquery_2.11:0.2.6*
>
> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>   ... 76 elided
> Caused by: java.lang.ClassNotFoundException:
> com.google.api.client.http.HttpRequestInitializer
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Oct 2020 at 20:09, Nicolas Paris 
> wrote:
>
>> once you got the jars from --package in the ~/.ivy2 folder you can then
>> add the list to --jars . in this way there is no missing dependency.
>>
>>
>> ayan guha  writes:
>>
>> > Hi
>> >
>> > One way to think of this is --packages is better when you have third
>> party
>> > dependency and --jars is better when you have custom in-house built jars.
>> >
>> > On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <
>> mich.talebza...@gmail.com>
>> > wrote:
>> >
>> >> Thanks Sean and Russell. Much appreciated.
>> >>
>> >> Just to clarify recently I had issues with different versions of Google
>> >> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> >> These used to work a year and half ago using Google Dataproc compute
>> >> engines (comes with Spark preloaded) and I could create an Uber jar
>> file.
>> >>
>> >> Unfortunately this has become problematic now so tried to use
>> spark-submit
>> >> instead as follows:
>> >>
>> >> ${SPARK_HOME}/bin/spark-submit \
>> >> --master yarn \
>> >> --deploy-mode client \
>> >> --conf spark.executor.memoryOverhead=3000 \
>> >> --class org

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
Hi Nicolas,

I removed ~/.iv2 and reran the spark job with the package included (the one
working)

Under ~/.ivy/jars I Have 37 jar files, including the one that I had before.

/home/hduser/.ivy2/jars> ls
com.databricks_spark-avro_2.11-4.0.0.jar
 com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar
com.google.oauth-client_google-oauth-client-1.24.1.jar
org.checkerframework_checker-qual-2.5.2.jar
com.fasterxml.jackson.core_jackson-core-2.9.2.jar
com.google.cloud.bigdataoss_gcsio-1.9.4.jar
com.google.oauth-client_google-oauth-client-java6-1.24.1.jar
org.codehaus.jackson_jackson-core-asl-1.9.13.jar
com.github.samelamin_spark-bigquery_2.11-0.2.6.jar
 com.google.cloud.bigdataoss_util-1.9.4.jar
 commons-codec_commons-codec-1.6.jar
 org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
com.google.api-client_google-api-client-1.24.1.jar
 com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar
commons-logging_commons-logging-1.1.1.jar
 org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
com.google.api-client_google-api-client-jackson2-1.24.1.jar
com.google.code.findbugs_jsr305-3.0.2.jar
com.thoughtworks.paranamer_paranamer-2.3.jar
org.slf4j_slf4j-api-1.7.5.jar
com.google.api-client_google-api-client-java6-1.24.1.jar
 com.google.errorprone_error_prone_annotations-2.1.3.jar
joda-time_joda-time-2.9.3.jar
 org.tukaani_xz-1.0.jar
com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar
com.google.guava_guava-26.0-jre.jar
org.apache.avro_avro-1.7.6.jar
org.xerial.snappy_snappy-java-1.0.5.jar
com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar
 com.google.http-client_google-http-client-1.24.1.jar
 org.apache.commons_commons-compress-1.4.1.jar
com.google.auto.value_auto-value-annotations-1.6.2.jar
 com.google.http-client_google-http-client-jackson2-1.24.1.jar
org.apache.httpcomponents_httpclient-4.0.1.jar
com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar
com.google.j2objc_j2objc-annotations-1.1.jar
 org.apache.httpcomponents_httpcore-4.0.1.jar

I don't think I need to add all of these to spark-submit --jars list. Is
there a way I can find out which dependency is missing

This is the error I am getting when I use the jar file
* com.github.samelamin_spark-bigquery_2.11-0.2.6.jar* instead of the
package *com.github.samelamin:spark-bigquery_2.11:0.2.6*

java.lang.NoClassDefFoundError:
com/google/api/client/http/HttpRequestInitializer
  at
com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
  at
com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
  at
com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
  ... 76 elided
Caused by: java.lang.ClassNotFoundException:
com.google.api.client.http.HttpRequestInitializer
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


Thanks



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Oct 2020 at 20:09, Nicolas Paris 
wrote:

> once you got the jars from --package in the ~/.ivy2 folder you can then
> add the list to --jars . in this way there is no missing dependency.
>
>
> ayan guha  writes:
>
> > Hi
> >
> > One way to think of this is --packages is better when you have third
> party
> > dependency and --jars is better when you have custom in-house built jars.
> >
> > On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > wrote:
> >
> >> Thanks Sean and Russell. Much appreciated.
> >>
> >> Just to clarify recently I had issues with different versions of Google
> >> Guava jar files in building Uber jar file (to evict the unwanted ones).
> >> These used to work a year and half ago using Google Dataproc compute
> >> engines (comes with Spark preloaded) and I could create an Uber jar
> file.
> >>
> >> Unfortunately this has become problematic now so tried to use
> spark-submit
> >> instead as follows:
> >>
> >> ${SPARK_HOME}/bin/spark-submit \
> >> --master yarn \
> >> --deploy-mode client \
> >> --conf spark.executor.memoryOverhead=3000 \
> >> --class org.apache.spark.repl.Main \
> >> --name "Spark shell on Yarn" "$@"
> >> --driver-class-path /home/hduser/jars/ddhybrid.jar \
> >> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
> >>/home/hduser/jars/ddhybrid.jar \
> >> --packages
> com.github.samelamin:spark-bigquery_2.11:0.2.6
> >>
> >> Effectively tailored spark-shell. However, I do not think there is a
> >> mechanism to resolve jar conflicts w

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Nicolas Paris
once you got the jars from --package in the ~/.ivy2 folder you can then
add the list to --jars . in this way there is no missing dependency.


ayan guha  writes:

> Hi
>
> One way to think of this is --packages is better when you have third party
> dependency and --jars is better when you have custom in-house built jars.
>
> On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh 
> wrote:
>
>> Thanks Sean and Russell. Much appreciated.
>>
>> Just to clarify recently I had issues with different versions of Google
>> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> These used to work a year and half ago using Google Dataproc compute
>> engines (comes with Spark preloaded) and I could create an Uber jar file.
>>
>> Unfortunately this has become problematic now so tried to use spark-submit
>> instead as follows:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>> --master yarn \
>> --deploy-mode client \
>> --conf spark.executor.memoryOverhead=3000 \
>> --class org.apache.spark.repl.Main \
>> --name "Spark shell on Yarn" "$@"
>> --driver-class-path /home/hduser/jars/ddhybrid.jar \
>> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>/home/hduser/jars/ddhybrid.jar \
>> --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>>
>> Effectively tailored spark-shell. However, I do not think there is a
>> mechanism to resolve jar conflicts without  building an Uber jar file
>> through SBT?
>>
>> Cheers
>>
>>
>>
>> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer 
>> wrote:
>>
>>> --jar Adds only that jar
>>> --package adds the Jar and a it's dependencies listed in maven
>>>
>>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 I have a scenario that I use in Spark submit as follows:

 spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
 /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
 */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*

 As you can see the jar files needed are added.


 This comes back with error message as below


 Creating model test.weights_MODEL

 java.lang.NoClassDefFoundError:
 com/google/api/client/http/HttpRequestInitializer

   at
 com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

   at
 com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

   at
 com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

   ... 76 elided

 Caused by: java.lang.ClassNotFoundException:
 com.google.api.client.http.HttpRequestInitializer

   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)



 So there is an issue with finding the class, although the jar file used


 /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

 has it.


 Now if *I remove the above jar file and replace it with the same
 version but package* it works!


 spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
 /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
 *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*


 I have read the write-ups about packages searching the maven
 libraries etc. Not convinced why using the package should make so much
 difference between a failure and success. In other words, when to use a
 package rather than a jar.


 Any ideas will be appreciated.


 Thanks



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>> --
> Best Regards,
> Ayan Guha


-- 
nicolas paris

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why spark-submit works with package not with jar

2020-10-20 Thread ayan guha
Hi

One way to think of this is --packages is better when you have third party
dependency and --jars is better when you have custom in-house built jars.

On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh 
wrote:

> Thanks Sean and Russell. Much appreciated.
>
> Just to clarify recently I had issues with different versions of Google
> Guava jar files in building Uber jar file (to evict the unwanted ones).
> These used to work a year and half ago using Google Dataproc compute
> engines (comes with Spark preloaded) and I could create an Uber jar file.
>
> Unfortunately this has become problematic now so tried to use spark-submit
> instead as follows:
>
> ${SPARK_HOME}/bin/spark-submit \
> --master yarn \
> --deploy-mode client \
> --conf spark.executor.memoryOverhead=3000 \
> --class org.apache.spark.repl.Main \
> --name "Spark shell on Yarn" "$@"
> --driver-class-path /home/hduser/jars/ddhybrid.jar \
> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>/home/hduser/jars/ddhybrid.jar \
> --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>
> Effectively tailored spark-shell. However, I do not think there is a
> mechanism to resolve jar conflicts without  building an Uber jar file
> through SBT?
>
> Cheers
>
>
>
> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer 
> wrote:
>
>> --jar Adds only that jar
>> --package adds the Jar and a it's dependencies listed in maven
>>
>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a scenario that I use in Spark submit as follows:
>>>
>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>
>>> As you can see the jar files needed are added.
>>>
>>>
>>> This comes back with error message as below
>>>
>>>
>>> Creating model test.weights_MODEL
>>>
>>> java.lang.NoClassDefFoundError:
>>> com/google/api/client/http/HttpRequestInitializer
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>
>>>   ... 76 elided
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.google.api.client.http.HttpRequestInitializer
>>>
>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>>
>>>
>>> So there is an issue with finding the class, although the jar file used
>>>
>>>
>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>
>>> has it.
>>>
>>>
>>> Now if *I remove the above jar file and replace it with the same
>>> version but package* it works!
>>>
>>>
>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>
>>>
>>> I have read the write-ups about packages searching the maven
>>> libraries etc. Not convinced why using the package should make so much
>>> difference between a failure and success. In other words, when to use a
>>> package rather than a jar.
>>>
>>>
>>> Any ideas will be appreciated.
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>> --
Best Regards,
Ayan Guha


Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread Lalwani, Jayesh
Are you getting any output? Streaming jobs typically run forever, and keep 
processing data as it comes in the input. If a streaming job is working well, 
it will typically generate output at a certain cadence

From: KhajaAsmath Mohammed 
Date: Tuesday, October 20, 2020 at 1:23 PM
To: "user @spark" 
Subject: [EXTERNAL] Spark Structured streaming - Kakfa - slowness with query 0


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hi,

I have started using spark structured streaming for reading data from kaka and 
the job is very slow. Number of output rows keeps increasing in query 0 and the 
job is running forever. any suggestions for this please?

[cid:image001.png@01D6A6EA.F513EC50]

Thanks,
Asmath


Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread KhajaAsmath Mohammed
Hi,

I have started using spark structured streaming for reading data from kaka
and the job is very slow. Number of output rows keeps increasing in query 0
and the job is running forever. any suggestions for this please?

[image: image.png]

Thanks,
Asmath


Pyspark Framework for Apache Atlas (especially Tagging)

2020-10-20 Thread Dennis Suhari
Hi Spark Community, does somebody knows a Pyspark framework that integrates 
with Apache Atlas ? I want to trigger tagging etc. durch my Pyspark Dataframe 
Operations. Atlas has an API which I could use. So I could write my own 
framework. But before I do this I wanted to ask whether knows something in the 
open source community.

Br,

Dennis

Von meinem iPhone gesendet
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
Thanks Sean and Russell. Much appreciated.

Just to clarify recently I had issues with different versions of Google
Guava jar files in building Uber jar file (to evict the unwanted ones).
These used to work a year and half ago using Google Dataproc compute
engines (comes with Spark preloaded) and I could create an Uber jar file.

Unfortunately this has become problematic now so tried to use spark-submit
instead as follows:

${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.executor.memoryOverhead=3000 \
--class org.apache.spark.repl.Main \
--name "Spark shell on Yarn" "$@"
--driver-class-path /home/hduser/jars/ddhybrid.jar \
--jars /home/hduser/jars/spark-bigquery-latest.jar, \
   /home/hduser/jars/ddhybrid.jar \
--packages com.github.samelamin:spark-bigquery_2.11:0.2.6

Effectively tailored spark-shell. However, I do not think there is a
mechanism to resolve jar conflicts without  building an Uber jar file
through SBT?

Cheers



On Tue, 20 Oct 2020 at 16:54, Russell Spitzer 
wrote:

> --jar Adds only that jar
> --package adds the Jar and a it's dependencies listed in maven
>
> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a scenario that I use in Spark submit as follows:
>>
>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>
>> As you can see the jar files needed are added.
>>
>>
>> This comes back with error message as below
>>
>>
>> Creating model test.weights_MODEL
>>
>> java.lang.NoClassDefFoundError:
>> com/google/api/client/http/HttpRequestInitializer
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>
>>   ... 76 elided
>>
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.api.client.http.HttpRequestInitializer
>>
>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>>
>>
>> So there is an issue with finding the class, although the jar file used
>>
>>
>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>
>> has it.
>>
>>
>> Now if *I remove the above jar file and replace it with the same version
>> but package* it works!
>>
>>
>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>
>>
>> I have read the write-ups about packages searching the maven
>> libraries etc. Not convinced why using the package should make so much
>> difference between a failure and success. In other words, when to use a
>> package rather than a jar.
>>
>>
>> Any ideas will be appreciated.
>>
>>
>> Thanks
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
Probably because your JAR file requires other JARs which you didn't supply.
If you specify a package, it reads metadata like a pom.xml file to
understand what other dependent JARs also need to be loaded.

On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh 
wrote:

> Hi,
>
> I have a scenario that I use in Spark submit as follows:
>
> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>
> As you can see the jar files needed are added.
>
>
> This comes back with error message as below
>
>
> Creating model test.weights_MODEL
>
> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>
>   ... 76 elided
>
> Caused by: java.lang.ClassNotFoundException:
> com.google.api.client.http.HttpRequestInitializer
>
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
>
> So there is an issue with finding the class, although the jar file used
>
>
> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>
> has it.
>
>
> Now if *I remove the above jar file and replace it with the same version
> but package* it works!
>
>
> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>
>
> I have read the write-ups about packages searching the maven
> libraries etc. Not convinced why using the package should make so much
> difference between a failure and success. In other words, when to use a
> package rather than a jar.
>
>
> Any ideas will be appreciated.
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Why spark-submit works with package not with jar

2020-10-20 Thread Russell Spitzer
--jar Adds only that jar
--package adds the Jar and a it's dependencies listed in maven

On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh 
wrote:

> Hi,
>
> I have a scenario that I use in Spark submit as follows:
>
> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>
> As you can see the jar files needed are added.
>
>
> This comes back with error message as below
>
>
> Creating model test.weights_MODEL
>
> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>
>   ... 76 elided
>
> Caused by: java.lang.ClassNotFoundException:
> com.google.api.client.http.HttpRequestInitializer
>
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
>
> So there is an issue with finding the class, although the jar file used
>
>
> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>
> has it.
>
>
> Now if *I remove the above jar file and replace it with the same version
> but package* it works!
>
>
> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>
>
> I have read the write-ups about packages searching the maven
> libraries etc. Not convinced why using the package should make so much
> difference between a failure and success. In other words, when to use a
> package rather than a jar.
>
>
> Any ideas will be appreciated.
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
Hi,

I have a scenario that I use in Spark submit as follows:

spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
/home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
*/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*

As you can see the jar files needed are added.


This comes back with error message as below


Creating model test.weights_MODEL

java.lang.NoClassDefFoundError:
com/google/api/client/http/HttpRequestInitializer

  at
com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

  at
com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

  at
com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

  ... 76 elided

Caused by: java.lang.ClassNotFoundException:
com.google.api.client.http.HttpRequestInitializer

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)



So there is an issue with finding the class, although the jar file used


/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

has it.


Now if *I remove the above jar file and replace it with the same version
but package* it works!


spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
/home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
*-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*


I have read the write-ups about packages searching the maven libraries etc.
Not convinced why using the package should make so much difference between
a failure and success. In other words, when to use a package rather than a
jar.


Any ideas will be appreciated.


Thanks



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


【The decimal result is incorrectly enlarged by 100 times】

2020-10-20 Thread 王长春
Hi ,
I have came across a problem about correctness of spark decimal, and I have 
researched for it a few days. This problem is very curious.

My spark version is spark 2.3.1

I have a sql like this:
Create table table_S stored as orc as 
Select a*b*c from table_a
Union all
Select d from table_B
Union all
Select e from table_C

Column a b c are all decimal(38,4)
Column d is also decimal(38,4)
Column e is also decimal(38,4)

The result of this sql is wrong ,The result is 100 times greater than the 
correct value.

The weird thing is :If I delete “create table” clause , the result is correct.
And I change the order of union , the result is also correct.
E.g

Create table table_S stored as orc as 
Select d from table_B
Union all
Select a*b*c from table_a
Union all
Select e from table_C


Besides , spark 2.3.2 can gave correct result in this case. But I checked all 
the patch of 2.3.2, can not find which patch solve this problem.


Can anyone gave some Help? Has anyone encountered the same problem?



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Organize an Meetup of Apache Spark

2020-10-20 Thread Raúl Martín Saráchaga Díaz
Hi,

I would like to organize a meetup of Apache Spark in Lima, Peru. I love share 
with all the community.


Regards,


Raúl Saráchaga