[spark-core] docker-image-tool.sh question...

2021-03-09 Thread Muthu Jayakumar
Hello there, While using docker-image-tool (for Spark 3.1.1) it seems to not accept `java_image_tag` property. The docker image default to JRE 11. Here is what I am running from the command line. $ spark/bin/docker-image-tool.sh -r docker.io/sample-spark -b java_image_tag=8-jre-slim -t 3.1.1

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-09 Thread Pankaj Bhootra
Hi, Could someone please revert on this? Thanks Pankaj Bhootra On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, wrote: > Hello Team > > I am new to Spark and this question may be a possible duplicate of the > issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 > > We have a

Re: spark 3.1.1 support hive 1.2

2021-03-09 Thread jiahong li
thanks, i try it right now Kent Yao 于2021年3月10日周三 上午11:11写道: > Hi Li, > Have you tried `Interacting with Different Versions of Hive Metastore` > http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore > > > Bests, > > *Kent Yao

Re:spark 3.1.1 support hive 1.2

2021-03-09 Thread Kent Yao
Hi Li,Have you tried `Interacting with Different Versions of Hive Metastore` http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore  Bests,

spark 3.1.1 support hive 1.2

2021-03-09 Thread jiahong li
Hi,sorry to bother you.In spark 3.0.1,hive-1.2 is supported,but in spark 3.1.x maven profile hive-1.1 is removed.Is that means hive-1.2 does not supported in spark 3.1.x? how can i support hive-1.2 in spark 3.1.x,or any jira? can anyone help me ?

Speed up Spark writes to Google Cloud storage

2021-03-09 Thread SRK
hi, Our Spark writes to GCS are slow. The reason I see is that a staging directory used for the initial data generation following by copying the data to actual directory in GCS. Following are few configs and code. Any suggestions on how to speed this thing up will be great.

Re: Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Mich Talebzadeh
Thanks Sean, I am using PySpark. There seems to be some reports on foreach usage with local mode back on the 3rd March. For example, see "Spark structured streaming seems to work on local mode only" I believe the thread owner was reporting on* foreach *case not foreachBatch. cheers

Re: Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Sean Owen
That should not be the case. See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch Maybe you are calling .foreach on some Scala object inadvertently. On Tue, Mar 9, 2021 at 4:41 PM Mich Talebzadeh wrote: > Hi, > > When I use

Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Mich Talebzadeh
Hi, When I use *foreachBatch *is Spark structured streaming, yarn mode works fine. When one switches to *foreach* mode (row by row processing), this effectively runs in local mode on a single JVM. It seems to crash when running in a distributed mode. That is my experience. Can someone else

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Not sure if kinesis have such flexibility. What else possibilities are there at transformations level? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Any example for this please -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Sean Owen
You can also group by the key in the transformation on each batch. But yes that's faster/easier if it's already partitioned that way. On Tue, Mar 9, 2021 at 7:30 AM Ali Gouta wrote: > Do not know Kenesis, but it looks like it works like kafka. Your producer > should implement a paritionner that

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Ali Gouta
Do not know Kenesis, but it looks like it works like kafka. Your producer should implement a paritionner that makes it possible to send your data with the same key to the same partition. Though, each task in your spark streaming app will load data from the same partition in the same executor. I

Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
We are doing batch processing using Spark Streaming with Kinesis with a batch size of 5 mins. We want to send all events with same eventId to same executor for a batch so that we can do multiple events based grouping operations based on eventId. No previous batch or future batch data is concerned.

Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
We are doing batch processing using Spark Streaming with Kinesis with a batch size of 5 mins. We want to send all events with same eventId to same executor for a batch so that we can do multiple events based grouping operations based on eventId. No previous batch or future batch data is concerned.