[spark-core] docker-image-tool.sh question...

2021-03-09 Thread Muthu Jayakumar
Hello there,

While using docker-image-tool (for Spark 3.1.1) it seems to not accept
`java_image_tag` property. The docker image default to JRE 11. Here is what
I am running from the command line.

$ spark/bin/docker-image-tool.sh -r docker.io/sample-spark -b
java_image_tag=8-jre-slim -t 3.1.1 build

Please advice,
Muthu


Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-09 Thread Pankaj Bhootra
Hi,

Could someone please revert on this?


Thanks
Pankaj Bhootra


On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra,  wrote:

> Hello Team
>
> I am new to Spark and this question may be a possible duplicate of the
> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>
> We have a large dataset partitioned by calendar date, and within each date
> partition, we are storing the data as *parquet* files in 128 parts.
>
> We are trying to run aggregation on this dataset for 366 dates at a time
> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
> 366*128=46848 partitions, all of which are parquet files. There is
> currently no *_metadata* or *_common_metadata* file(s) available for this
> dataset.
>
> The problem we are facing is that when we try to run *spark.read.parquet* on
> the above 46848 partitions, our data reads are extremely slow. It takes a
> long time to run even a simple map task (no shuffling) without any
> aggregation or group by.
>
> I read through the above issue and I think I perhaps generally understand
> the ideas around *_common_metadata* file. But the above issue was raised
> for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation
> related to this metadata file so far.
>
> I would like to clarify:
>
>1. What's the latest, best practice for reading large number of
>parquet files efficiently?
>2. Does this involve using any additional options with
>spark.read.parquet? How would that work?
>3. Are there other possible reasons for slow data reads apart from
>reading metadata for every part? We are basically trying to migrate our
>existing spark pipeline from using csv files to parquet, but from my
>hands-on so far, it seems that parquet's read time is slower than csv? This
>seems contradictory to popular opinion that parquet performs better in
>terms of both computation and storage?
>
>
> Thanks
> Pankaj Bhootra
>
>
>
> -- Forwarded message -
> From: Takeshi Yamamuro (Jira) 
> Date: Sat, 6 Mar 2021, 20:02
> Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
> Extremely Slow for Large Number of Files?
> To: 
>
>
>
> [
> https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296528#comment-17296528
> ]
>
> Takeshi Yamamuro commented on SPARK-34648:
> --
>
> Please use the mailing list (user@spark.apache.org) instead. This is not
> a right place to ask questions.
>
> > Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> > 
> >
> > Key: SPARK-34648
> > URL: https://issues.apache.org/jira/browse/SPARK-34648
> > Project: Spark
> >  Issue Type: Question
> >  Components: SQL
> >Affects Versions: 2.3.0
> >Reporter: Pankaj Bhootra
> >Priority: Major
> >
> > Hello Team
> > I am new to Spark and this question may be a possible duplicate of the
> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
> > We have a large dataset partitioned by calendar date, and within each
> date partition, we are storing the data as *parquet* files in 128 parts.
> > We are trying to run aggregation on this dataset for 366 dates at a time
> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
> 366*128=46848 partitions, all of which are parquet files. There is
> currently no *_metadata* or *_common_metadata* file(s) available for this
> dataset.
> > The problem we are facing is that when we try to run
> *spark.read.parquet* on the above 46848 partitions, our data reads are
> extremely slow. It takes a long time to run even a simple map task (no
> shuffling) without any aggregation or group by.
> > I read through the above issue and I think I perhaps generally
> understand the ideas around *_common_metadata* file. But the above issue
> was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any
> documentation related to this metadata file so far.
> > I would like to clarify:
> >  # What's the latest, best practice for reading large number of parquet
> files efficiently?
> >  # Does this involve using any additional options with
> spark.read.parquet? How would that work?
> >  # Are there other possible reasons for slow data reads apart from
> reading metadata for every part? We are basically trying to migrate our
> existing spark pipeline from using csv files to parquet, but from my
> hands-on so far, it seems that parquet's read time is slower than csv? This
> seems contradictory to popular opinion that parquet performs better in
> terms of both computation and storage?
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>


Re: spark 3.1.1 support hive 1.2

2021-03-09 Thread jiahong li
thanks, i try it right now

Kent Yao  于2021年3月10日周三 上午11:11写道:

> Hi Li,
> Have you tried `Interacting with Different Versions of Hive Metastore`
> http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
>
>
> Bests,
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *spark-func-extras A
> library that brings excellent and useful functions from various modern
> database management systems to Apache Spark .*
>
>
>
> On 03/10/2021 10:56,jiahong li
>  wrote:
>
> Hi,sorry to bother you.In spark 3.0.1,hive-1.2 is supported,but in spark
> 3.1.x maven profile hive-1.1 is removed.Is that means hive-1.2 does not
> supported  in spark 3.1.x? how can i support hive-1.2 in spark 3.1.x,or any
> jira? can anyone help me ?
>
>


Re:spark 3.1.1 support hive 1.2

2021-03-09 Thread Kent Yao






Hi Li,Have you tried `Interacting with Different Versions of Hive Metastore` http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore 




Bests,

  



















Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.















 


On 03/10/2021 10:56,jiahong li wrote: 


Hi,sorry to bother you.In spark 3.0.1,hive-1.2 is supported,but in spark 3.1.x maven profile hive-1.1 is removed.Is that means hive-1.2 does not supported  in spark 3.1.x? how can i support hive-1.2 in spark 3.1.x,or any jira? can anyone help me ?






spark 3.1.1 support hive 1.2

2021-03-09 Thread jiahong li
Hi,sorry to bother you.In spark 3.0.1,hive-1.2 is supported,but in spark
3.1.x maven profile hive-1.1 is removed.Is that means hive-1.2 does not
supported  in spark 3.1.x? how can i support hive-1.2 in spark 3.1.x,or any
jira? can anyone help me ?


Speed up Spark writes to Google Cloud storage

2021-03-09 Thread SRK
hi,

Our Spark writes to GCS are slow. The reason I see is that a staging
directory used for the initial data generation following by copying the data
to actual directory in GCS. Following are few configs and code. Any
suggestions on how to speed this thing up will be great.

sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode",
"dynamic")
   
sparkSession.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version",
"2")
   
sparkSession.conf.set("spark.hadoop.mapreduce.use.directfileoutputcommitter",
"true")
sparkSession.conf.set(
  "spark.hadoop.mapred.output.committer.class",
  "org.apache.hadoop.mapred.DirectFileOutputCommitter"
)

sparkSession.sparkContext.hadoopConfiguration
  .set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

sparkSession.sparkContext.hadoopConfiguration
  .set("spark.speculation", "false")


snapshotInGCS.write
  .option("header", "true")
  .option("emptyValue", "")
  .option("delimiter", "^")
  .mode(SaveMode.Overwrite)
  .format("csv")
  .partitionBy("date", "id")
  .option("compression", "gzip")
  .save(s"gs://${bucketName}/${folderName}")



Thank you,
SK



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Mich Talebzadeh
Thanks Sean,

I am using PySpark. There seems to be some reports on foreach usage with
local mode back on the 3rd March. For example, see

"Spark structured streaming seems to work on local mode only"

I believe the thread owner was reporting on* foreach *case not foreachBatch.

cheers


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 9 Mar 2021 at 22:51, Sean Owen  wrote:

> That should not be the case. See
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
> Maybe you are calling .foreach on some Scala object inadvertently.
>
> On Tue, Mar 9, 2021 at 4:41 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> When I use *foreachBatch *is Spark structured streaming, yarn mode works
>> fine.
>>
>> When one switches to *foreach* mode (row by row processing), this
>> effectively runs in local mode on a single JVM. It seems to crash when
>> running in a distributed mode. That is my experience.
>>
>> Can someone else please verify this independently?
>>
>> Cheers
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Sean Owen
That should not be the case. See
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
Maybe you are calling .foreach on some Scala object inadvertently.

On Tue, Mar 9, 2021 at 4:41 PM Mich Talebzadeh 
wrote:

> Hi,
>
> When I use *foreachBatch *is Spark structured streaming, yarn mode works
> fine.
>
> When one switches to *foreach* mode (row by row processing), this
> effectively runs in local mode on a single JVM. It seems to crash when
> running in a distributed mode. That is my experience.
>
> Can someone else please verify this independently?
>
> Cheers
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Mich Talebzadeh
Hi,

When I use *foreachBatch *is Spark structured streaming, yarn mode works
fine.

When one switches to *foreach* mode (row by row processing), this
effectively runs in local mode on a single JVM. It seems to crash when
running in a distributed mode. That is my experience.

Can someone else please verify this independently?

Cheers




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Not sure if kinesis have such flexibility. What else possibilities are there
at transformations level?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Any example for this please



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Sean Owen
You can also group by the key in the transformation on each batch. But yes
that's faster/easier if it's already partitioned that way.

On Tue, Mar 9, 2021 at 7:30 AM Ali Gouta  wrote:

> Do not know Kenesis, but it looks like it works like kafka. Your producer
> should implement a paritionner that makes it possible to send your data
> with the same key to the same partition. Though, each task in your spark
> streaming app will load data from the same partition in the same executor.
> I think this is the simplest way to achieve what you want to do.
>
> Best regards,
> Ali Gouta.
>
> On Tue, Mar 9, 2021 at 11:30 AM forece85  wrote:
>
>> We are doing batch processing using Spark Streaming with Kinesis with a
>> batch
>> size of 5 mins. We want to send all events with same eventId to same
>> executor for a batch so that we can do multiple events based grouping
>> operations based on eventId. No previous batch or future batch data is
>> concerned. Only Current batch keyed operation needed.
>>
>> Please help me how to achieve this.
>>
>> Thanks.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Ali Gouta
Do not know Kenesis, but it looks like it works like kafka. Your producer
should implement a paritionner that makes it possible to send your data
with the same key to the same partition. Though, each task in your spark
streaming app will load data from the same partition in the same executor.
I think this is the simplest way to achieve what you want to do.

Best regards,
Ali Gouta.

On Tue, Mar 9, 2021 at 11:30 AM forece85  wrote:

> We are doing batch processing using Spark Streaming with Kinesis with a
> batch
> size of 5 mins. We want to send all events with same eventId to same
> executor for a batch so that we can do multiple events based grouping
> operations based on eventId. No previous batch or future batch data is
> concerned. Only Current batch keyed operation needed.
>
> Please help me how to achieve this.
>
> Thanks.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
We are doing batch processing using Spark Streaming with Kinesis with a batch
size of 5 mins. We want to send all events with same eventId to same
executor for a batch so that we can do multiple events based grouping
operations based on eventId. No previous batch or future batch data is
concerned. Only Current batch keyed operation needed.

Please help me how to achieve this. 

Thanks.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
We are doing batch processing using Spark Streaming with Kinesis with a batch
size of 5 mins. We want to send all events with same eventId to same
executor for a batch so that we can do multiple events based grouping
operations based on eventId. No previous batch or future batch data is
concerned. Only Current batch keyed operation needed.

Please help me how to achieve this. 

Thanks.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org