Image Grep

2019-05-06 Thread swastik mittal
My spark driver program reads multiple images from hdfs and searches for a particular image using image name. If it finds the image, It converts the received byte array of the image back to its original form. But the image I get on conversion is showing corrupted image. I am using ImageSchema to

batch processing in spark

2019-05-05 Thread swastik mittal
>From my experience in spark, when working on hdfs data base, spark reads data in form of records and does computation on every record as soon as it reads it. I have multiple images as my data on hdfs, where each image is a record. I want spark to read multiple records before doing any

Not able to convert Image binary to an image

2019-04-19 Thread swastik mittal
Hi, I am working with Apache Spark 2.3.2, implementing an image grep application using Scala 2.11. I am reading images from HDFS using ImageSchema package. The series of step I run are: 1. import org.apache.spark.ml.image.ImageSchema 2. val df = ImageSchema.readImages("hdfs://filepath/*") // all

How does spark operate internally for an indivisual task?

2019-03-14 Thread swastik mittal
I am running a grep application on spark 2.3.4 and scala version 2.11. I have an input textfile of 813MB stored on a remote source (not a part of spark infrastructure) using hdfs. My application just reads the textfile line by line from hdfs server and filters for a given keyword in each line and

Re: Yarn job is Stuck

2019-03-14 Thread swastik mittal
It is possible that the Application Master is not getting started. Try increasing the memory limit of the application master in yarn-site.xml or in capacity-scheduler if you have it configured. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Build spark source code with scala 2.11

2019-03-12 Thread swastik mittal
Then are the mlib of spark compatible with scala 2.12? Or can I change the spark version from spark3.0 to 2.3 or 2.4 in local spark/master? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Build spark source code with scala 2.11

2019-03-12 Thread swastik mittal
I am trying to build my spark using build/sbt package, after changing the scala versions to 2.11 in pom.xml because my applications jar files use scala 2.11. But building the spark code gives an error in sql saying "A method with a varargs annotation produces a forwarder method with the same

Milliseconds in timestamp

2019-03-02 Thread swastik mittal
Is there a way I can log even the milliseconds elapsed apart from HH:mm::ss for every logged timestamps of spark? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

updateBytesRead()

2019-03-01 Thread swastik mittal
Hi, In Spark source code, Hadoop.scala (in RDD). Spark updates the information of total bytes read after every 1000 records. Displaying the bytes read along side the update function it shows 65536. Even if I change the code to update bytes read after every record it, it still shows 65536 multiple

Detect data from textFile RDD

2019-02-22 Thread swastik mittal
Hey, I am working with spark source code. I am printing logs within the code to understand how hadoopRDD works. I wan't to print a timestamp when executor first reads the textFile RDD (input source(URL) in form of hdfs). I tried to print some logs in executor.scala but they do not display on the

Remote Data Read Time

2019-01-10 Thread swastik mittal
I was working with custom spark listener library. There, I am not able to figure out a way to break into the details of task. I only have a listener which runs on task start, But I want to calculate the time my executor took to read input data from remote data source for that task, but as spark

Re: Read Time from a remote data source

2018-12-19 Thread swastik mittal
I am running a model where the workers should not have the data stored in them. They are only for execution purpose. The other cluster (its just a single node) which I am receiving data from is just acting as a file server, for which I could have used any other way like nfs or ftp. So I went with

Read Time from a remote data source

2018-12-18 Thread swastik mittal
Hi, I am new to spark. I am running a hdfs file system on a remote cluster whereas my spark workers are on another cluster. When my textFile RDD gets executed, does spark worker read from the file according to hdfs partitions task by task, or do they read it once when the blockmanager sets after