Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
ake a difference regarding I guess, in the question above I do have to process row-wise and RDD may be more efficient? Thanks, Muthu On Tue, 12 Jul 2022 at 14:55, ayan guha wrote: > Another option is: > > 1. collect the dataframe with file path > 2. create a list of paths > 3. crea

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
sparkContext. I did try to use GCS Java API to read content, but ran into many JAR conflicts as the HDFS wrapper and the JAR library uses different dependencies. Hope this findings helps others as well. Thanks, Muthu On Mon, 11 Jul 2022 at 14:11, Enrico Minack wrote: > All you need to

reading each JSON file from dataframe...

2022-07-10 Thread Muthu Jayakumar
`, `other_useful_id`, `json_content`, `file_path`. Assume that I already have the required HDFS url libraries in my classpath. Please advice, Muthu

Re: [spark-core] docker-image-tool.sh question...

2021-03-10 Thread Muthu Jayakumar
ago /bin/sh -c #(nop) ADD file:3a7bff4e139bcacc5… 69.2MB (2) $ docker run --entrypoint "/usr/local/openjdk-8/bin/java" 3ef86250a35b '-version' openjdk version "1.8.0_275" OpenJDK Runtime Environment (build 1.8.0_275-b01) OpenJDK 64-Bit Server VM (build 25.275-b01, mix

[spark-core] docker-image-tool.sh question...

2021-03-09 Thread Muthu Jayakumar
build Please advice, Muthu

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar
older version, make sure all of them are older than 3.2.11 at the least. Hope it helps. Thanks, Muthu On Mon, Feb 17, 2020 at 1:15 PM Mich Talebzadeh wrote: > Thanks Muthu, > > > I am using the following jar files for now in local mode i.e. > spark-shell_local > --j

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar
I suspect the spark job is somehow having an incorrect (newer) version of json4s in the classpath. json4s 3.5.3 is the utmost version that can be used. Thanks, Muthu On Mon, Feb 17, 2020, 06:43 Mich Talebzadeh wrote: > Hi, > > Spark version 2.4.3 > Hbase 1.2.7 > > Data

Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar
If you would require higher precision, you may have to write a custom udaf. In my case, I ended up storing the data as a key-value ordered list of histograms. Thanks Muthu On Mon, Nov 11, 2019, 20:46 Patrick McCarthy wrote: > Depending on your tolerance for error you could also

Re: Core allocation is scattered

2019-07-31 Thread Muthu Jayakumar
>I am running a spark job with 20 cores but i did not understand why my application get 1-2 cores on couple of machines why not it just run on two nodes like node1=16 cores and node 2=4 cores . but cores are allocated like node1=2 node =1-node 14=1 like that. I believe that's the intended

Number of tasks...

2019-07-29 Thread Muthu Jayakumar
data size + shuffle operation. Please advice Muthu

Re: Re: Can an UDF return a custom class other than case class?

2019-01-07 Thread Muthu Jayakumar
Perhaps use of generic StructType may work in your situation of being language agnostic? case-classes are backed by implicits to provide type conversions into columnar. My 2 cents. Thanks, Mutu On Mon, Jan 7, 2019 at 4:13 AM yeikel valdes wrote: > > > Forwarded Message

Re: Spark job on dataproc failing with Exception in thread "main" java.lang.NoSuchMethodError: com.googl

2018-12-20 Thread Muthu Jayakumar
The error reads as Precondition.checkArgument() method is on an incorrect parameter signature. Could you check to see how many jars (before the Uber jar), actually contain this method signature? I smell an issue with jar version conflict or similar. Thanks Muthu On Thu, Dec 20, 2018, 02:40 Mich

Re: error in job

2018-10-06 Thread Muthu Jayakumar
The error means that, you are missing commons-configuration-version.jar from the classpath of the driver/worker. Thanks, Muthu On Sat, Sep 29, 2018 at 11:55 PM yuvraj singh <19yuvrajsing...@gmail.com> wrote: > Hi , i am getting this error please help me . > > > 18/09/30 05

Re: Encoder for JValue

2018-09-19 Thread Muthu Jayakumar
A naive workaround may be to transform the json4s JValue to String (using something like compact()) and process it as String? Once you are done with the last action, you could write it back as JValue (using something like parse()) Thanks, Muthu On Wed, Sep 19, 2018 at 6:35 AM Arko Provo

Re: Parquet

2018-07-20 Thread Muthu Jayakumar
I generally write to Parquet when I want to repeat the operation of reading data and perform different operations on it every time. This would save db time for me. Thanks Muthu On Thu, Jul 19, 2018, 18:34 amin mohebbi wrote: > We do have two big tables each includes 5 billion of rows, so

Spark + CDB (Cockroach DB) support...

2018-06-15 Thread Muthu Jayakumar
-locality. Thanks Muthu

Re: Does Spark run on Java 10?

2018-04-01 Thread Muthu Jayakumar
It is supported with some limitations on JSR 376 (JPMS) that can cause linker errors. Thanks, Muthu On Sun, Apr 1, 2018 at 11:15 AM, kant kodali <kanth...@gmail.com> wrote: > Hi Muthu, > > "On a side note, if some coming version of Scala 2.11 becomes full Java > 9/10

Re: Does Spark run on Java 10?

2018-04-01 Thread Muthu Jayakumar
version of Scala 2.11 becomes full Java 9/10 compliant it could work. Hope, this helps. Thanks, Muthu On Sun, Apr 1, 2018 at 6:57 AM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > Does anybody got Spark running on Java 10? > > Thanks! > > >

Re: Spark submit OutOfMemory Error in local mode

2017-08-29 Thread muthu
Are you getting OutOfMemory on the driver or on the executor? Typical cause of OOM in Spark can be due to fewer number of tasks for a job. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29117.html Sent

Re: Kill Spark Application programmatically in Spark standalone cluster

2017-08-29 Thread muthu
I had similar question in the past and worked around by having my spark-submit application to register to my master application in-order to co-ordinate kill and/or progress of execution. This is a bit clergy I suppose in comparison to a REST like API available in the spark stand-alone cluster.

Spark standalone API...

2017-08-29 Thread muthu
available to SparkListener interface that's available with-in every spark application. Please advice, Muthu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-API-tp29115.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread Muthu Jayakumar
/api/scala/index.html#org.apache.spark.sql.Dataset). This way I can change the numbers by the data. Thanks, Muthu On Wed, Jul 19, 2017 at 8:23 AM, ayan guha <guha.a...@gmail.com> wrote: > You can use spark.sql.shuffle.partitions to adjust amount of parallelism. > > On Wed, Jul 19, 2

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread muthu
itions'. In ideal situations, we have a long running application that uses the same spark-session and runs one or more query using FAIR mode. Thanks, Muthu On Wed, Jul 19, 2017 at 6:03 AM, qihuagao [via Apache Spark User List] < ml+s1001560n28879...@n3.nabble.com> wrote: >

DataFrame --- join / groupBy-agg question...

2017-07-11 Thread muthu
I may be having a naive question on join / groupBy-agg. During the days of RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an optional Partition-Strategy (with is number of partitions or Partitioner) b. join (of PairRDDFunctions) and its

DataFrame --- join / groupBy-agg question...

2017-07-11 Thread Muthu Jayakumar
during a join is to set 'spark.sql.shuffle.partitions' it some desired number during spark-submit. I am trying to see if there is a way to provide this programmatically for every step of a groupBy-agg / join. Please advice, Muthu

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
I run a spark-submit(https://spark.apache.org/docs/latest/spark-standalone. html#launching-spark-applications) in client-mode that starts the micro-service. If you keep the event loop going then the spark context would remain active. Thanks, Muthu On Mon, Jun 5, 2017 at 2:44 PM, kant kodali

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
park in Spark Standalone with a 32 node cluster. Hope this gives some better idea. Thanks, Muthu On Sun, Jun 4, 2017 at 10:33 PM, kant kodali <kanth...@gmail.com> wrote: > Hi Muthu, > > I am actually using Play framework for my Micro service which uses Akka > but I still don't unde

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Muthu Jayakumar
them to read Parquet and respond back results. Hope this helps Thanks Muthu On Mon, Jun 5, 2017, 01:01 Sandeep Nemuri <nhsande...@gmail.com> wrote: > Well if you are using Hortonworks distribution there is Livy2 which is > compatible with Spark2 and scala 2.11. > > > https:/

Spark repartition question...

2017-04-30 Thread Muthu Jayakumar
astore Please advice, Muthu

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
alysis with full table scans scenarios. But I am thankful for many ideas and perspectives on how this could be looked at. Thanks, Muthu On Wed, Mar 15, 2017 at 7:25 PM, Shiva Ramagopal <tr.s...@gmail.com> wrote: > Hi, > > The choice of ES vs Cassandra should really be made dependin

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
our thoughts. Thanks, Muthu On Wed, Mar 15, 2017 at 10:55 AM, vvshvv <vvs...@gmail.com> wrote: > Hi muthu, > > I agree with Shiva, Cassandra also supports SASI indexes, which can > partially replace Elasticsearch functionality. > > Regards, > Uladzimir > > > >

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar
solution may be Spark to Kafka to ElasticSearch? More thoughts welcome please. Thanks, Muthu On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling <rsiebel...@gmail.com> wrote: > maybe Apache Ignite does fit your requirements > > On 15 March 2017 at 08:44, vincent gromakowski <

Fast write datastore...

2017-03-15 Thread muthu
perform simple filters and sort using ElasticSearch and for more complex aggregate, Spark Dataframe can come back to the rescue :). Please advice on other possible data-stores I could use? Thanks, Muthu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write

Re: Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar
This worked. Thanks for the tip Michael. Thanks, Muthu On Thu, Feb 16, 2017 at 12:41 PM, Michael Armbrust <mich...@databricks.com> wrote: > The toString method of Dataset.queryExecution includes the various plans. > I usually just log that directly. > > On Thu, Feb 16, 2017

Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar
).executedPlan.executeCollect().foreach { // scalastyle:off println r => println(r.getString(0)) // scalastyle:on println } } sessionState is not accessible if I were to write my own explain(log: LoggingAdapter). Please advice, Muthu

Re: Dataframe caching

2017-01-20 Thread Muthu Jayakumar
I guess, this may help in your case? https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view Thanks, Muthu On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Dear all, > > Here is a requiremen

Re: Dependency Injection and Microservice development with Spark

2016-12-30 Thread Muthu Jayakumar
Adding to Lars Albertsson & Miguel Morales, I am hoping to see how well scalameta would branch down into support for macros that can rid away sizable DI problems and for the reminder having a class type as args as Miguel Morales mentioned. Thanks, On Wed, Dec 28, 2016 at 6:41 PM, Miguel Morales

StreamingContext.textFileStream(...)

2016-12-07 Thread muthu
have to only combine the previously combined result with the result from the current time tn. Please advice, Muthu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-tp28183.html Sent from the Apache Spark User List mailing

Re: DataFrame select non-existing column

2016-11-18 Thread Muthu Jayakumar
Depending on your use case, 'df.withColumn("my_existing_or_new_col", lit(0l))' could work? On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren wrote: > Thanks for your answer. I have been searching the API for doing that > but I could not find how to do it? > > Could you give

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0. Thanks, Muthu On Fri, Oct 21, 2016 at 3:30 PM, Cheng Lian <l...@databricks.com> wrote: > Yea, confirmed. While analyzing unions, we treat StructTypes with > different field nullabilities as incompatible type

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
.scala:161) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59) at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) at org.apache.spark.sql.Dataset.union(Dataset.scala:1459) Please advice, Muthu On Thu, Oct 20, 2016 at 1:46 AM, Michael Armb

Re: Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
on the simple schema of "col1 thru col4" above. But the problem seem to exist only on that "some_histogram" column which contains the mixed containsNull = true/false. Let me know if this helps. Thanks, Muthu On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <mich...@databricks.

Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
ue) ||-- freq: array (nullable = true) |||-- element: long (containsNull = true) Is there a way to convert this attribute from true to false without running any mapping / udf on that column? Please advice, Muthu

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Muthu Jayakumar
Hello Hao Ren, Doesn't the code... val add = udf { (a: Int) => a + notSer.value } Mean UDF function that Int => Int ? Thanks, Muthu On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren <inv...@gmail.com> wrote: > I am playing with spark 2.0 > What I tried to test is: > &g

Re: Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar
.read.parquet(parquetFile).toJavaRDD.partitions.size() res2: Int = 20 Can I suspect something with dynamic allocation perhaps? Please advice, Muthu On Sat, Aug 6, 2016 at 3:23 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > 720 cores Wow. That is a hell of cores Muthu :) > > Ok le

Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar
On a side note, I do understand that 200 parquet part files for the above 2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18 parts or so. Please advice, Muthu

Re: Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
, but i don't know how to split and map the row elegantly. Hence using it as RDD. Thanks, Muthu On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng <mengdong0...@gmail.com> wrote: > you can specify nullable in StructField > > On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar <bablo...@

Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
t[0, org.apache.spark.sql.Row, true], top level row object) +- input[0, org.apache.spark.sql.Row, true] Let me know if you would like me try to create a more simplified reproducer to this problem. Perhaps I should not be using Option[T] for nullable schema values? Please advice, Muthu

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
Does increasing the number of partition helps? You could try out something 3 times what you currently have. Another trick i used was to partition the problem into multiple dataframes and run them sequentially and persistent the result and then run a union on the results. Hope this helps. On Fri,

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
nt from my Verizon Wireless 4G LTE smartphone > > > ---- Original message > From: Muthu Jayakumar <bablo...@gmail.com> > Date: 01/22/2016 3:50 PM (GMT-05:00) > To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" < > sande...@rose-hulman.edu

Re: cast column string -> timestamp in Parquet file

2016-01-21 Thread Muthu Jayakumar
DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the column that requires to be changed. Hope this helps. On Thu, Jan 21, 2016 at 6:17 AM, Eli Super wrote: > Hi > > I have a large size parquet file . > > I need

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar
Thanks Micheal. Let me test it with a recent master code branch. Also for every mapping step should I have to create a new case class? I cannot use Tuple as I have ~130 columns to process. Earlier I had used a Seq[Any] (actually Array[Any] to optimize on serialization) but processed it using RDD

Re: Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Muthu Jayakumar
>export SPARK_WORKER_MEMORY=4g May be you could increase the max heapsize on the worker? In case if the OutOfMemory is for the driver, then you may want to set it up explicitly for the driver. Thanks, On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote: > Hello, > >

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar
mutableRow.setNullAt(0); /* 144 */ } else { /* 145 */ /* 146 */ mutableRow.update(0, primitive1); /* 147 */ } /* 148 */ /* 149 */ return mutableRow; /* 150 */ } /* 151 */ } /* 152 */ Thanks. On Tue, Jan 12, 2016 at 11:35 AM, Muthu Jayakumar <bablo...@gmail.com> wrote: > Tha

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Muthu Jayakumar
this may make it strongly typed? Thank you for looking into my email. Thanks, Muthu On Mon, Jan 11, 2016 at 3:08 PM, Michael Armbrust <mich...@databricks.com> wrote: > Also, while extracting a value into Dataset using as[U] method, how could >> I specify a custom encoder/translation

Spark 1.6 udf/udaf alternatives in dataset?

2016-01-10 Thread Muthu Jayakumar
] method, how could I specify a custom encoder/translation to case class (where I don't have the same column-name mapping or same data-type mapping)? Please advice, Muthu

Re: Out of memory issue

2016-01-06 Thread Muthu Jayakumar
f.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO, MemoryManager.DEFAULT_MEMORY_POOL_RATIO);). I wonder how is this provided thru Apache Spark. Meaning, I see that 'TaskAttemptContext' seems to be the hint to provide this. But I am not able to find a way I could provide this configuration. Please advice, Muthu On

Re: Spark and Spring Integrations

2015-11-15 Thread Muthu Jayakumar
t can use the default serializer provided by Spark. Hope this helps. Thanks, Muthu On Sat, Nov 14, 2015 at 10:18 PM, Netai Biswas <mail2efo...@gmail.com> wrote: > Hi, > > Thanks for your response. I will give a try with akka also, if you have > any sample code or useful link please

Re: Spark and Spring Integrations

2015-11-14 Thread Muthu Jayakumar
You could try to use akka actor system with apache spark, if you are intending to use it in online / interactive job execution scenario. On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > You are probably trying to access the spring context from the

Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.
= + logRecord); } Where do I create new JavaRDDString? DO I do it before this loop? How do I create this JavaRDDString? In the loop I am able to get every record and I am able to print them. I appreciate any help here. Thanks, Muthu

RE: Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.
; } }); return null; } -Original Message- From: Sundaram, Muthu X. [mailto:muthu.x.sundaram@sabre.com] Sent: Tuesday, July 22, 2014 10:24 AM To: user@spark.apache.org; d...@spark.incubator.apache.org Subject: Tranforming flume events using

RE: writing FLume data to HDFS

2014-07-14 Thread Sundaram, Muthu X.
, July 11, 2014 1:43 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: writing FLume data to HDFS What is the error you are getting when you say ??I was trying to write the data to hdfs..but it fails… TD On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X. muthu.x.sundaram

writing FLume data to HDFS

2014-07-10 Thread Sundaram, Muthu X.
I am new to spark. I am trying to do the following. Netcat--Flume--Spark streaming(process Flume Data)--HDFS. My flume config file has following set up. Source = netcat Sink=avrosink. Spark Streaming code: I am able to print data from flume to the monitor. But I am struggling to create a file.