date:20191203

spark writeStream not working with custom S3 endpoint

2019-12-03 Thread Aniruddha P Tekade

Hello,

While working with Spark Structured Streaming (v2.4.3) I am trying to write
my streaming dataframe to a custom S3. I have made sure that I am able to
login, upload data to s3 buckets manually using UI and have also setup
ACCESS_KEY and SECRET_KEY for it.

val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.s3a.endpoint",
"s3-region1.myObjectStore.com:443")
sc.hadoopConfiguration.set("fs.s3a.access.key", "00cce9eb2c589b1b1b5b")
sc.hadoopConfiguration.set("fs.s3a.secrete.key",
"flmheKX9Gb1tTlImO6xR++9kvnUByfRKZfI7LJT8")
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true") //
bucket name appended as url/bucket and not bucket.url

val writeToS3Query = stream.writeStream
  .format("csv")
  .option("sep", ",")
  .option("header", true)
  .outputMode("append")
  .trigger(Trigger.ProcessingTime("30 seconds"))
  .option("path", "s3a://bucket0/")
  .option("checkpointLocation", "/Users/home/checkpoints/s3-checkpointing")
  .start()

However, I am getting the error that

Unable to execute HTTP request: bucket0.s3-region1.myObjectStore.com:
nodename nor servname provided, or not known

I have mapping of URL and IP in my /etc/hosts file and the bucket is
accessable from other sources. Is there any other way to do this
successfully? I am really not sure why bucket name is being appended before
URL when it is executed by Spark.

Can this be because I am setting up the spark context hadoop configurations
after session is created and so they are not effective? But then how it is
able to refer the actual URL when in the path I am providing value as
s3a://bucket0.

Best,
Aniruddha
---
ᐧ

Spark Dataset API for secondary sorting

2019-12-03 Thread Daniel Zhang

Hi, Spark Users:

I have a question related to the way I use the spark Dataset API for my case.

If the "ds_old" dataset is having 100 records, with 10 unique $"col1", and for 
the following pseudo-code:

val ds_new = ds_old.repartition(5, 
$"col1").sortWithinPartitions($"col2").mapPartitions(new MergeFuc)

class MergeFun extends MapPartitionsFunction[InputCaseClass, OutputCaseClass] {
  override def call(input: util.Iterator[InputCaseClass]): 
util.Iterator[OutputCaseClass] = {}
}


I have some questions related to "partition" defined in the above API, and 
below is my understanding:

1) repartition(5, $"col1") means distributing all 100 records based on 10 
unique col1 values to 5 partitions. There is no guarantee each of these 5 
partitions will have how many/which unique col1 value, but in a well-balanced 
hash algorithm, each partition will have close to the average count (10/5 = 2) 
for a large unique count of values.
2) sortWithPartitions($"col2) is one of the parts I want to clear out here. 
What is exactly the sortWithPartitions meaning here? I want the data sorted by 
"col2" within each unique value of "col1" here, but the Spark API uses the 
"partition" term so much in this case. I DON'T WANT the 100 records sorted 
within each of the 5 partitions, but within each unique of "col1". I believe 
this assumption is right, as we use "repartition" with "col1" first. Please 
confirm this.
3) mapPartitions(new MergeFuc) is another part I want to clear out. I 
originally assumed that my merge function will be called/invoked per unique 
col1 value (in this case we have 10 partitions). But after the test, I found 
out that indeed it is called ONCE per partition of the 5 partitions. So in this 
sense, the partition meaning in this API (mapPartitions) IS DIFFERENT as the 
partition meaning defined in "sortWithPartitions", correct? Or my understanding 
of "partition" in sortWithPartitions is also WRONG?

In summary, here are my questions:
1) We don't want to use "aggregation" API is due to that in my case, some 
unique value of "col1" COULD contain a big number of records, and sorting the 
data in a specified order per col1 helps our business case for the merge logic 
a lot.
2) We don't want to use "window" function, as the merge logic is indeed an 
aggregation logic. There will be only one record output as per grouping (col1). 
So even "window" function comes with sorting, but it doesn't fit in this case.
3) The unique value count of "col1" is unpredictable for spark, I understand 
that. But I wonder if there is an API that can be used to be called per 
grouping (per col1), instead of per partition (as defined as 5 partitions in 
this case).
4) If such API doesn't exist, and we have to use MapPartitionsFunction (The 
Iterator is much preferred here, as we don't need to worry OOM due to data 
skew), my following question is if Spark guarantees that the data comes within 
each partition is (col1, col2) order, in the API usage shown above? Or if Spark 
will delivery the data of each partition, sorted by "col2" for the first unique 
value of col1; then sorted by "col2" for the second unique value of col1, going 
forward, etc?
Another challenge is that if our merge function can expect the data in this 
order, but have to generate the output per grouping of col1, in an Iterator 
format, does Spark have an existing example to refer?

Thanks

Yong

spark writeStream not working with custom S3 endpoint

Spark Dataset API for secondary sorting

2 matches

Site Navigation

Mail list logo

Footer information