Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
> compared to Dataset's mapPartitions / map function? >> >> Enrico >> >> >> Am 12.07.22 um 22:13 schrieb Muthu Jayakumar: >> >> Hello Enrico, >> >> Thanks for the reply. I found that I would have to use `mapPartitions` >> API of RDD to

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
cbh52che104rwy603sr| > |id-01f7pqqbxejt3ef4ap9qcs78m5|content of > gs://bucket1/path/to/id-01g4he5cbqmdv7dnx46sebs0gt/file_result.json|id-2-01g4he5cbqmdv7dnx46sebs0gt| > |id-01f7pqqbynh895ptpjjfxvk6dc|content of > gs://bucket1/path/to/id-01g4he5cbx1kwhgvdme1s560dw/file_re

reading each JSON file from dataframe...

2022-07-10 Thread Muthu Jayakumar
Hello there, I have a dataframe with the following... +-+---+---+ |entity_id|file_path |other_useful_id|

Re: [spark-core] docker-image-tool.sh question...

2021-03-10 Thread Muthu Jayakumar
7: Pull complete > 262868e4544c: Pull complete > 1c0fec43ba3f: Pull complete > Digest: > sha256:412c52d88d77ea078c50ed4cf8d8656d6448b1c92829128e1c6aab6687ce0998 > *Status: Downloaded newer image for openjdk:8-jre-slim* > ---> 8f867fdbd02f > > What you see at your side? > > Best regar

[spark-core] docker-image-tool.sh question...

2021-03-09 Thread Muthu Jayakumar
Hello there, While using docker-image-tool (for Spark 3.1.1) it seems to not accept `java_image_tag` property. The docker image default to JRE 11. Here is what I am running from the command line. $ spark/bin/docker-image-tool.sh -r docker.io/sample-spark -b java_image_tag=8-jre-slim -t 3.1.1

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar
ge or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 17 Fe

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar
I suspect the spark job is somehow having an incorrect (newer) version of json4s in the classpath. json4s 3.5.3 is the utmost version that can be used. Thanks, Muthu On Mon, Feb 17, 2020, 06:43 Mich Talebzadeh wrote: > Hi, > > Spark version 2.4.3 > Hbase 1.2.7 > > Data is stored in Hbase as

Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar
If you would require higher precision, you may have to write a custom udaf. In my case, I ended up storing the data as a key-value ordered list of histograms. Thanks Muthu On Mon, Nov 11, 2019, 20:46 Patrick McCarthy wrote: > Depending on your tolerance for error you could also use >

Re: Core allocation is scattered

2019-07-31 Thread Muthu Jayakumar
>I am running a spark job with 20 cores but i did not understand why my application get 1-2 cores on couple of machines why not it just run on two nodes like node1=16 cores and node 2=4 cores . but cores are allocated like node1=2 node =1-node 14=1 like that. I believe that's the intended

Number of tasks...

2019-07-29 Thread Muthu Jayakumar
Hello there, I have a basic question with how the number of tasks are determined per spark job. Let's say the scope of this discussion around parquet and Spark 2.x. 1. I thought that, the number of jobs is proportional to the number of part files that exist. Is this correct? 2. I noticed that for

Re: Re: Can an UDF return a custom class other than case class?

2019-01-07 Thread Muthu Jayakumar
Perhaps use of generic StructType may work in your situation of being language agnostic? case-classes are backed by implicits to provide type conversions into columnar. My 2 cents. Thanks, Mutu On Mon, Jan 7, 2019 at 4:13 AM yeikel valdes wrote: > > > Forwarded Message

Re: Spark job on dataproc failing with Exception in thread "main" java.lang.NoSuchMethodError: com.googl

2018-12-20 Thread Muthu Jayakumar
The error reads as Precondition.checkArgument() method is on an incorrect parameter signature. Could you check to see how many jars (before the Uber jar), actually contain this method signature? I smell an issue with jar version conflict or similar. Thanks Muthu On Thu, Dec 20, 2018, 02:40 Mich

Re: error in job

2018-10-06 Thread Muthu Jayakumar
The error means that, you are missing commons-configuration-version.jar from the classpath of the driver/worker. Thanks, Muthu On Sat, Sep 29, 2018 at 11:55 PM yuvraj singh <19yuvrajsing...@gmail.com> wrote: > Hi , i am getting this error please help me . > > > 18/09/30 05:14:44 INFO Client: >

Re: Encoder for JValue

2018-09-19 Thread Muthu Jayakumar
A naive workaround may be to transform the json4s JValue to String (using something like compact()) and process it as String? Once you are done with the last action, you could write it back as JValue (using something like parse()) Thanks, Muthu On Wed, Sep 19, 2018 at 6:35 AM Arko Provo

Re: Parquet

2018-07-20 Thread Muthu Jayakumar
I generally write to Parquet when I want to repeat the operation of reading data and perform different operations on it every time. This would save db time for me. Thanks Muthu On Thu, Jul 19, 2018, 18:34 amin mohebbi wrote: > We do have two big tables each includes 5 billion of rows, so my

Spark + CDB (Cockroach DB) support...

2018-06-15 Thread Muthu Jayakumar
Hello there, I am trying to check to see CDB is available for Apache Spark. I could currently use CDB using Postgres driver. But I would like to check to see if there are any specialized drivers that I can use which optimizes for predicate-push-down and other optimizations pertaining to

Re: Does Spark run on Java 10?

2018-04-01 Thread Muthu Jayakumar
compliant it could work." > > From the links, you pointed out. It looks like Scala 2.11.12 is compliant > with Java 9/10? > > Thanks! > > > > On Sun, Apr 1, 2018 at 7:50 AM, Muthu Jayakumar <bablo...@gmail.com> > wrote: > >> Short answer may be

Re: Does Spark run on Java 10?

2018-04-01 Thread Muthu Jayakumar
Short answer may be no. Spark runs on Scala 2.11. Even Scala 2.12 is also not fully Java 9 compliant. For more info... http://docs.scala-lang.org/overviews/jdk-compatibility/overview.html --- check the last section. https://issues.apache.org/jira/browse/SPARK-14220 On a side note, if some coming

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread Muthu Jayakumar
The problem with 'spark.sql.shuffle.partitions' is that, it needs to be set before spark session is create (I guess?). But ideally, I want to partition by column during a join / group-by (something roughly like repartitionBy(partitionExpression: Column*) from

DataFrame --- join / groupBy-agg question...

2017-07-11 Thread Muthu Jayakumar
Hello there, I may be having a naive question on join / groupBy-agg. During the days of RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an optional Partition-Strategy (with is number of partitions or Partitioner) b. join (of PairRDDFunctions)

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
<kanth...@gmail.com> wrote: > Are you launching SparkSession from a MicroService or through spark-submit > ? > > On Sun, Jun 4, 2017 at 11:52 PM, Muthu Jayakumar <bablo...@gmail.com> > wrote: > >> Hello Kant, >> >> >I still don't unde

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
Hello Kant, >I still don't understand How SparkSession can use Akka to communicate with SparkCluster? Let me use your initial requirement as a way to illustrate what I mean -- i.e, "I want my Micro service app to be able to query and access data on HDFS" In order to run a query say a DF query

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Muthu Jayakumar
One drastic suggestion can be to write a simple microservice using Akka and create a SparkSession (during the start of vm) and pass it around. You can look at SparkPI for sample source code to start writing your microservice. In my case, I used akka http to wrap my business requests and transform

Spark repartition question...

2017-04-30 Thread Muthu Jayakumar
Hello there, I am trying to understand the difference between the following reparition()... a. def repartition(partitionExprs: Column*): Dataset[T] b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] c. def repartition(numPartitions: Int): Dataset[T] My understanding is

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
store since you are doing a batch-like processing (reading > from Parquet files) and it is possibly to control this part fully. And it > also seems like you want to use ES. You can try to reduce the number of > Spark executors to throttle the writes to ES. > > -Shiva > &

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
views on top of it to support your > queries. Since you mention that your queries are going to be simple you can > define your indexes in the materialized views according to how you want to > query the data. > > Thanks, > Shiva > > > > On Wed, Mar 15, 2017 at 7:58 PM, Mut

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
Hello Vincent, Cassandra may not fit my bill if I need to define my partition and other indexes upfront. Is this right? Hello Richard, Let me evaluate Apache Ignite. I did evaluate it 3 months back and back then the connector to Apache Spark did not support Spark 2.0. Another drastic thought

Re: Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar
This worked. Thanks for the tip Michael. Thanks, Muthu On Thu, Feb 16, 2017 at 12:41 PM, Michael Armbrust <mich...@databricks.com> wrote: > The toString method of Dataset.queryExecution includes the various plans. > I usually just log that directly. > > On Thu, Feb 16, 2017

Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar
Hello there, I am trying to write to log-line a dataframe/dataset queryExecution and/or its logical plan. The current code... def explain(extended: Boolean): Unit = { val explain = ExplainCommand(queryExecution.logical, extended = extended)

Re: Dataframe caching

2017-01-20 Thread Muthu Jayakumar
I guess, this may help in your case? https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view Thanks, Muthu On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Dear all, > > Here is a requirement I am thinking of

Re: Dependency Injection and Microservice development with Spark

2016-12-30 Thread Muthu Jayakumar
Adding to Lars Albertsson & Miguel Morales, I am hoping to see how well scalameta would branch down into support for macros that can rid away sizable DI problems and for the reminder having a class type as args as Miguel Morales mentioned. Thanks, On Wed, Dec 28, 2016 at 6:41 PM, Miguel Morales

Re: DataFrame select non-existing column

2016-11-18 Thread Muthu Jayakumar
Depending on your use case, 'df.withColumn("my_existing_or_new_col", lit(0l))' could work? On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren wrote: > Thanks for your answer. I have been searching the API for doing that > but I could not find how to do it? > > Could you give

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
the > analysis phase. > > Cheng > > On 10/21/16 12:50 PM, Muthu Jayakumar wrote: > > Sorry for the late response. Here is what I am seeing... > > > Schema from parquet file. > > d1.printSchema() > > root > |-- task_id: string (nullable = true)

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
rust <mich...@databricks.com> wrote: > What is the issue you see when unioning? > > On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar <bablo...@gmail.com> > wrote: > >> Hello Michael, >> >> Thank you for looking into this query. In my case there seem to be an >> is

Re: Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
ying to change the nullability of the column? > > On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <bablo...@gmail.com> > wrote: > >> Hello there, >> >> I am trying to understand how and when does DataFrame (or Dataset) sets >> nullable = true vs false

Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
Hello there, I am trying to understand how and when does DataFrame (or Dataset) sets nullable = true vs false on a schema. Here is my observation from a sample code I tried... scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", 2.0d))).toDF("col1", "col2",

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Muthu Jayakumar
Hello Hao Ren, Doesn't the code... val add = udf { (a: Int) => a + notSer.value } Mean UDF function that Int => Int ? Thanks, Muthu On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren wrote: > I am playing with spark 2.0 > What I tried to test is: > > Create a UDF in which

Re: Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar
elying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 6 August 2016 at 23:09, Muthu Jayakumar <bablo...@gmail.com> wrote: > >> H

Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar
Hello there, I am trying to understand how I could improve (or increase) the parallelism of tasks that run for a particular spark job. Here is my observation... scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size() 25 > hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l

Re: Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
, but i don't know how to split and map the row elegantly. Hence using it as RDD. Thanks, Muthu On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng <mengdong0...@gmail.com> wrote: > you can specify nullable in StructField > > On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar <bablo...@

Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
Hello there, I am using Spark 2.0.0 to create a parquet file using a text file with Scala. I am trying to read a text file with bunch of values of type string and long (mostly). And all the occurrences can be null. In order to support nulls, all the values are boxed with Option (ex:-

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
Does increasing the number of partition helps? You could try out something 3 times what you currently have. Another trick i used was to partition the problem into multiple dataframes and run them sequentially and persistent the result and then run a union on the results. Hope this helps. On Fri,

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
nt from my Verizon Wireless 4G LTE smartphone > > > ---- Original message > From: Muthu Jayakumar <bablo...@gmail.com> > Date: 01/22/2016 3:50 PM (GMT-05:00) > To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" < > sande...@rose-hulman.edu

Re: cast column string -> timestamp in Parquet file

2016-01-21 Thread Muthu Jayakumar
DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the column that requires to be changed. Hope this helps. On Thu, Jan 21, 2016 at 6:17 AM, Eli Super wrote: > Hi > > I have a large size parquet file . > > I need

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar
Thanks Micheal. Let me test it with a recent master code branch. Also for every mapping step should I have to create a new case class? I cannot use Tuple as I have ~130 columns to process. Earlier I had used a Seq[Any] (actually Array[Any] to optimize on serialization) but processed it using RDD

Re: Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Muthu Jayakumar
>export SPARK_WORKER_MEMORY=4g May be you could increase the max heapsize on the worker? In case if the OutOfMemory is for the driver, then you may want to set it up explicitly for the driver. Thanks, On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote: > Hello, > >

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar
mutableRow.setNullAt(0); /* 144 */ } else { /* 145 */ /* 146 */ mutableRow.update(0, primitive1); /* 147 */ } /* 148 */ /* 149 */ return mutableRow; /* 150 */ } /* 151 */ } /* 152 */ Thanks. On Tue, Jan 12, 2016 at 11:35 AM, Muthu Jayakumar <bablo...@gmail.com> wrote: > Tha

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Muthu Jayakumar
Hello Michael, Thank you for the suggestion. This should do the trick for column names. But how could I transform columns value type? Do I have to use an UDF? In case if I use UDF, then the other question I may have is pertaining to the map step in dataset, where I am running into an error when I

Spark 1.6 udf/udaf alternatives in dataset?

2016-01-10 Thread Muthu Jayakumar
Hello there, While looking at the features of Dataset, it seem to provide an alternative way towards udf and udaf. Any documentation or sample code snippet to write this would be helpful in rewriting existing UDFs into Dataset mapping step. Also, while extracting a value into Dataset using as[U]

Re: Out of memory issue

2016-01-06 Thread Muthu Jayakumar
Thanks Ewan Leith. This seems like a good start, as it seem to match up to the symptoms I am seeing :). But, how do I specify "parquet.memory.pool.ratio"? Parquet code seem to take this parameter from ParquetOutputFormat.getRecordWriter() (ref code: float

Re: Spark and Spring Integrations

2015-11-15 Thread Muthu Jayakumar
gt; public void call(String t) throws Exception { > springBean.someAPI(t); // here we will have db transaction as > well. > } > });} > > Thanks, > Netai > > On Sat, Nov 14, 2015 at 10:40 PM, Muthu Jayakumar <bablo...@gmail.com> > wrote: &g

Re: Spark and Spring Integrations

2015-11-14 Thread Muthu Jayakumar
You could try to use akka actor system with apache spark, if you are intending to use it in online / interactive job execution scenario. On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > You are probably trying to access the spring context from the