Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Jungtaek Lim
Hmm... I read the page again, and it looks like we are in a gray area. Hadoop community supports JDK 11 starting from Hadoop 3.3, while we haven't reached adding Hadoop 3.3 as a dependency. It may not make a real issue on runtime with Hadoop 3.x as Spark is using a part of Hadoop (client layer),

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Jungtaek Lim
Hadoop 2.x doesn't support JDK 11. See Hadoop Java version compatibility with JDK: https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions That said, you'll need to use Spark 3.x with Hadoop 3.1 profile to make Spark work with JDK 11. On Tue, Mar 16, 2021 at 10:06 PM Sean Owen

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
I currently am just writing in python on small batch machine using s3 boto3.client. And the s3.get_paginator('list_objects_v2') And using NextContinuationToken. It is pretty fast on single bucket to list all with a million keys but have not extended it to all other buckets. Example code

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread brandonge...@gmail.com
One other possibility that might help is using the S3 SDK to generate the list you want and loading groups into dfs and doing unions as the end of the loading/filtering. Something likeimport com.amazonaws.services.s3.AmazonS3Clientimport com.amazonaws.services.s3.model.ListObjectsV2Requestimport

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
This is very helpful Boris. I will need to re-architect a piece of my code to work with this service but see it as more maintainable/stable long term. I will be developing it out over the course of a few weeks so will let you know how it goes. On Tue, Mar 16, 2021, 2:05 AM Boris Litvak wrote:

Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Oh, sure that was the reason. You can keep using the `foreachPartition` and get the partition ID from the `TaskContext`: scala> import org.apache.spark.TaskContext import org.apache.spark.TaskContext scala> myRDD.foreachPartition( e => { println(TaskContext.getPartitionId + ":" +

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
That's a very good idea, thanks for sharing German! On Tue, Mar 16, 2021 at 7:08 PM German Schiavon wrote: > Hi all, > > I guess you could do something like this too: > > [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] > > On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah < >

Re: How default partitioning in spark is deployed

2021-03-16 Thread German Schiavon
Hi all, I guess you could do something like this too: [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah wrote: > Hi Attila, > > Thanks for looking into this! > > I actually found the issue and it turned out to be that the print >

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Sean Owen
That looks like you didn't compile with Java 11 actually. How did you try to do so? On Tue, Mar 16, 2021, 7:50 AM kaki mahesh raja wrote: > HI All, > > We have compiled spark with java 11 ("11.0.9.1") and when testing the > thrift > server we are seeing that insert query from operator using

Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread kaki mahesh raja
HI All, We have compiled spark with java 11 ("11.0.9.1") and when testing the thrift server we are seeing that insert query from operator using beeline failing with the below error. {"type":"log", "level":"ERROR", "time":"2021-03-15T05:06:09.559Z", "timezone":"UTC", "log":"Uncaught exception in

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Attila, Thanks for looking into this! I actually found the issue and it turned out to be that the print statements misled me. The records are indeed stored in different partitions. What happened is since the foreachpartition method is run parallelly by different threads, they all printed the

Re: Spark streaming giving error for version 2.4

2021-03-16 Thread Attila Zsolt Piros
Hi! I am just guessing here (as Gabor said before we need more information / logs): But is it possible Renu that you just upgraded one single jar? Best Regards, Attila On Tue, Mar 16, 2021 at 11:31 AM Gabor Somogyi wrote: > Well, this is not much. Please provide driver and executor logs... >

Re: Spark streaming giving error for version 2.4

2021-03-16 Thread Gabor Somogyi
Well, this is not much. Please provide driver and executor logs... G On Tue, Mar 16, 2021 at 6:03 AM Renu Yadav wrote: > Hi Team, > > > I have upgraded my spark streaming from 2.2 to 2.4 but getting below error: > > > spark-streaming-kafka_0-10.2.11_2.4.0 > > > scala 2.11 > > > Any Idea? > >

Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Hi! This is weird. The code of foreachPartition leads to ParallelCollectionRDD

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Mich, Thanks for your precious time looking into my query. Yes, when we increase the number of objects, all partitions start having the data. I actually tried to understand what happens in my particular case. Thanks! On Tue, Mar 16, 2021 at 2:10 PM Mich Talebzadeh wrote: > Hi, > > Well as

Re: How default partitioning in spark is deployed

2021-03-16 Thread Mich Talebzadeh
Hi, Well as it appears you have 5 entries in your data and 12 cores. The theory is that you run multiple tasks in parallel across multiple cores on a desktop which applies to your case. The statistics is not there to give a meaningful interpretation why Spark decided to put all data in one

RE: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Boris Litvak
P.S.: 3. If fast updates are required, one way would be capturing S3 events & putting the paths/modifications dates/etc of the paths into DynamoDB/your DB of choice. From: Boris Litvak Sent: Tuesday, 16 March 2021 9:03 To: Ben Kaylor ; Alchemist Cc: User Subject: RE: How to make bucket

RE: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Boris Litvak
Ben, I’d explore these approaches: 1. To address your problem, I’d setup an inventory for the S3 bucket: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. Then you can list the files from the inventory. Have not tried this myself. Note that the inventory update