Re: Spark Structured Streaming org.apache.spark.sql.functions.input_file_name Intermittently Missing FileName

2021-10-12 Thread Alchemist
mp;&&&&&&&&&&&&  InputReportAndFileName fileName &&&&&&&&&&&&&  InputReportAndFileName fileName file:///Users/abc/Desktop/test/Streaming/b&&&&&&&&&&&&&  InputReportAndFileName file

Re: Spark Structured Streaming org.apache.spark.sql.functions.input_file_name Intermittently Missing FileName

2021-10-12 Thread Alchemist
      def getInputFilePath: UTF8String = inputBlock.get().get().filePath Can anyone help me understand what does it mean by file name unknown hence above API returning blank filename below. On Monday, October 11, 2021, 08:43:42 PM PDT, Alchemist wrote: Hello all, I am trying to extract file

Spark Structured Streaming org.apache.spark.sql.functions.input_file_name Intermittently Missing FileName

2021-10-11 Thread Alchemist
Hello all, I am trying to extract file name like following but intermittanly we are getting empty file name. Step 1: Get SchemaStructType jsonSchema = sparkSession.read() .option("multiLine", true) .json("src/main/resources/sample.json") .schema();Step2: Get Input DataSetDataset inputDS =

How to process S3 data in Scalable Manner Using Spark API (wholeTextFile VERY SLOW and NOT scalable)

2021-10-02 Thread Alchemist
Issue:  We are using wholeTextFile() API to read files from S3.  But this API is extremely SLOW due to reasons mentioned below.  Question is how to fix this issue? Here is our analysis so FAR:  Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile API works in two step.

How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Alchemist
How to optimize s3 list S3 file using wholeTextFile(): We are using wholeTextFile to read data from S3.  As per my understanding wholeTextFile first list files of given path.  Since we are using S3 as input source, then listing files in a bucket is single-threaded, the S3 API for listing the

Spark stuck at removing broadcast variable

2020-04-18 Thread Alchemist
I am running a simple Spark structured streaming application that is pulling data from a Kafka Topic. I have a Kafka Topic with nearly 1000 partitions. I am running this app on 6 node EMR cluster with 4 cores and 16GB RAM. I observed that Spark is trying to pull data from all 1024 Kafka

How to repartition Spark DStream Kafka ConsumerRecord RDD.

2018-09-28 Thread Alchemist
 How to repartition Spark DStream Kafka ConsumerRecord RDD.  I am getting uneven size of Kafka topics.. We want to repartition the input RDD based on some logic.  But when I try to apply the repartition I am getting "object not serializable (class:

Write data from Hbase using Spark Failing with NPE

2018-05-23 Thread Alchemist
aI am using Spark to write data to Hbase, I can read data just fine but write is failing with following exception. I found simila issue that got resolved by adding *site.xml and hbase JARs. But it is npot working for me.       JavaPairRDD  tablePuts =

Spark UNEVENLY distributing data

2018-05-19 Thread Alchemist
I am trying to parallelize a simple Spark program processes HBASE data in parallel.// Get Hbase RDD JavaPairRDD hBaseRDD = jsc .newAPIHadoopRDD(conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); long

Malformed URL Exception when connecting to Phoenix to Spark

2018-05-09 Thread Alchemist
CodeJavaSparkContext sc = new JavaSparkContext(sparkConf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); Map map = new HashMap(); map.put("zkUrl", args[2]); map.put("table", args[1]); map.put("driver",