ake a difference regarding
I guess, in the question above I do have to process row-wise and RDD may be
more efficient?
Thanks,
Muthu
On Tue, 12 Jul 2022 at 14:55, ayan guha wrote:
> Another option is:
>
> 1. collect the dataframe with file path
> 2. create a list of paths
> 3. crea
sparkContext. I did try to use GCS Java API to read content, but ran into
many JAR conflicts as the HDFS wrapper and the JAR library uses different
dependencies.
Hope this findings helps others as well.
Thanks,
Muthu
On Mon, 11 Jul 2022 at 14:11, Enrico Minack wrote:
> All you need to
`, `other_useful_id`,
`json_content`, `file_path`.
Assume that I already have the required HDFS url libraries in my classpath.
Please advice,
Muthu
ago /bin/sh -c #(nop) ADD
file:3a7bff4e139bcacc5… 69.2MB
(2) $ docker run --entrypoint "/usr/local/openjdk-8/bin/java" 3ef86250a35b
'-version'
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mix
build
Please advice,
Muthu
older version, make sure all of them are older
than 3.2.11 at the least.
Hope it helps.
Thanks,
Muthu
On Mon, Feb 17, 2020 at 1:15 PM Mich Talebzadeh
wrote:
> Thanks Muthu,
>
>
> I am using the following jar files for now in local mode i.e.
> spark-shell_local
> --j
I suspect the spark job is somehow having an incorrect (newer) version of
json4s in the classpath. json4s 3.5.3 is the utmost version that can be
used.
Thanks,
Muthu
On Mon, Feb 17, 2020, 06:43 Mich Talebzadeh
wrote:
> Hi,
>
> Spark version 2.4.3
> Hbase 1.2.7
>
> Data
If you would require higher precision, you may have to write a custom udaf.
In my case, I ended up storing the data as a key-value ordered list of
histograms.
Thanks
Muthu
On Mon, Nov 11, 2019, 20:46 Patrick McCarthy
wrote:
> Depending on your tolerance for error you could also
>I am running a spark job with 20 cores but i did not understand why my
application get 1-2 cores on couple of machines why not it just run on two
nodes like node1=16 cores and node 2=4 cores . but cores are allocated like
node1=2 node =1-node 14=1 like that.
I believe that's the intended
data size + shuffle operation.
Please advice
Muthu
Perhaps use of generic StructType may work in your situation of being
language agnostic? case-classes are backed by implicits to provide type
conversions into columnar.
My 2 cents.
Thanks,
Mutu
On Mon, Jan 7, 2019 at 4:13 AM yeikel valdes wrote:
>
>
> Forwarded Message
The error reads as Precondition.checkArgument() method is on an incorrect
parameter signature.
Could you check to see how many jars (before the Uber jar), actually
contain this method signature?
I smell an issue with jar version conflict or similar.
Thanks
Muthu
On Thu, Dec 20, 2018, 02:40 Mich
The error means that, you are missing commons-configuration-version.jar
from the classpath of the driver/worker.
Thanks,
Muthu
On Sat, Sep 29, 2018 at 11:55 PM yuvraj singh <19yuvrajsing...@gmail.com>
wrote:
> Hi , i am getting this error please help me .
>
>
> 18/09/30 05
A naive workaround may be to transform the json4s JValue to String (using
something like compact()) and process it as String? Once you are done with
the last action, you could write it back as JValue (using something like
parse())
Thanks,
Muthu
On Wed, Sep 19, 2018 at 6:35 AM Arko Provo
I generally write to Parquet when I want to repeat the operation of reading
data and perform different operations on it every time. This would save db
time for me.
Thanks
Muthu
On Thu, Jul 19, 2018, 18:34 amin mohebbi
wrote:
> We do have two big tables each includes 5 billion of rows, so
-locality.
Thanks
Muthu
It is supported with some limitations on JSR 376 (JPMS) that can cause
linker errors.
Thanks,
Muthu
On Sun, Apr 1, 2018 at 11:15 AM, kant kodali <kanth...@gmail.com> wrote:
> Hi Muthu,
>
> "On a side note, if some coming version of Scala 2.11 becomes full Java
> 9/10
version of Scala 2.11 becomes full Java 9/10
compliant it could work.
Hope, this helps.
Thanks,
Muthu
On Sun, Apr 1, 2018 at 6:57 AM, kant kodali <kanth...@gmail.com> wrote:
> Hi All,
>
> Does anybody got Spark running on Java 10?
>
> Thanks!
>
>
>
Are you getting OutOfMemory on the driver or on the executor? Typical cause
of OOM in Spark can be due to fewer number of tasks for a job.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29117.html
Sent
I had similar question in the past and worked around by having my
spark-submit application to register to my master application in-order to
co-ordinate kill and/or progress of execution. This is a bit clergy I
suppose in comparison to a REST like API available in the spark stand-alone
cluster.
available to SparkListener interface
that's available with-in every spark application.
Please advice,
Muthu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-API-tp29115.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
/api/scala/index.html#org.apache.spark.sql.Dataset).
This way I can change the numbers by the data.
Thanks,
Muthu
On Wed, Jul 19, 2017 at 8:23 AM, ayan guha <guha.a...@gmail.com> wrote:
> You can use spark.sql.shuffle.partitions to adjust amount of parallelism.
>
> On Wed, Jul 19, 2
itions'. In ideal situations, we have a long
running application that uses the same spark-session and runs one or more
query using FAIR mode.
Thanks,
Muthu
On Wed, Jul 19, 2017 at 6:03 AM, qihuagao [via Apache Spark User List] <
ml+s1001560n28879...@n3.nabble.com> wrote:
>
I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey
(of PairRDDFunctions) with an optional Partition-Strategy (with is number of
partitions or Partitioner) b. join (of PairRDDFunctions) and its
during a join
is to set 'spark.sql.shuffle.partitions' it some desired number during
spark-submit. I am trying to see if there is a way to provide this
programmatically for every step of a groupBy-agg / join.
Please advice,
Muthu
I run a spark-submit(https://spark.apache.org/docs/latest/spark-standalone.
html#launching-spark-applications) in client-mode that starts the
micro-service. If you keep the event loop going then the spark context
would remain active.
Thanks,
Muthu
On Mon, Jun 5, 2017 at 2:44 PM, kant kodali
park in Spark
Standalone with a 32 node cluster.
Hope this gives some better idea.
Thanks,
Muthu
On Sun, Jun 4, 2017 at 10:33 PM, kant kodali <kanth...@gmail.com> wrote:
> Hi Muthu,
>
> I am actually using Play framework for my Micro service which uses Akka
> but I still don't unde
them to read Parquet and respond back results.
Hope this helps
Thanks
Muthu
On Mon, Jun 5, 2017, 01:01 Sandeep Nemuri <nhsande...@gmail.com> wrote:
> Well if you are using Hortonworks distribution there is Livy2 which is
> compatible with Spark2 and scala 2.11.
>
>
> https:/
astore
Please advice,
Muthu
alysis with
full table scans scenarios.
But I am thankful for many ideas and perspectives on how this could be
looked at.
Thanks,
Muthu
On Wed, Mar 15, 2017 at 7:25 PM, Shiva Ramagopal <tr.s...@gmail.com> wrote:
> Hi,
>
> The choice of ES vs Cassandra should really be made dependin
our thoughts.
Thanks,
Muthu
On Wed, Mar 15, 2017 at 10:55 AM, vvshvv <vvs...@gmail.com> wrote:
> Hi muthu,
>
> I agree with Shiva, Cassandra also supports SASI indexes, which can
> partially replace Elasticsearch functionality.
>
> Regards,
> Uladzimir
>
>
>
>
solution may be Spark to Kafka to ElasticSearch?
More thoughts welcome please.
Thanks,
Muthu
On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling <rsiebel...@gmail.com>
wrote:
> maybe Apache Ignite does fit your requirements
>
> On 15 March 2017 at 08:44, vincent gromakowski <
perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?
Thanks,
Muthu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write
This worked. Thanks for the tip Michael.
Thanks,
Muthu
On Thu, Feb 16, 2017 at 12:41 PM, Michael Armbrust <mich...@databricks.com>
wrote:
> The toString method of Dataset.queryExecution includes the various plans.
> I usually just log that directly.
>
> On Thu, Feb 16, 2017
).executedPlan.executeCollect().foreach
{
// scalastyle:off println
r => println(r.getString(0))
// scalastyle:on println
}
}
sessionState is not accessible if I were to write my own explain(log:
LoggingAdapter).
Please advice,
Muthu
I guess, this may help in your case?
https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view
Thanks,
Muthu
On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:
> Dear all,
>
> Here is a requiremen
Adding to Lars Albertsson & Miguel Morales, I am hoping to see how
well scalameta would branch down into support for macros that can rid away
sizable DI problems and for the reminder having a class type as args as Miguel
Morales mentioned.
Thanks,
On Wed, Dec 28, 2016 at 6:41 PM, Miguel Morales
have to only combine the previously combined result with the result from the
current time tn.
Please advice,
Muthu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-tp28183.html
Sent from the Apache Spark User List mailing
Depending on your use case, 'df.withColumn("my_existing_or_new_col",
lit(0l))' could work?
On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren
wrote:
> Thanks for your answer. I have been searching the API for doing that
> but I could not find how to do it?
>
> Could you give
Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0.
Thanks,
Muthu
On Fri, Oct 21, 2016 at 3:30 PM, Cheng Lian <l...@databricks.com> wrote:
> Yea, confirmed. While analyzing unions, we treat StructTypes with
> different field nullabilities as incompatible type
.scala:161)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
Please advice,
Muthu
On Thu, Oct 20, 2016 at 1:46 AM, Michael Armb
on the simple schema of "col1
thru col4" above. But the problem seem to exist only on that
"some_histogram" column which contains the mixed containsNull = true/false.
Let me know if this helps.
Thanks,
Muthu
On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <mich...@databricks.
ue)
||-- freq: array (nullable = true)
|||-- element: long (containsNull = true)
Is there a way to convert this attribute from true to false without running
any mapping / udf on that column?
Please advice,
Muthu
Hello Hao Ren,
Doesn't the code...
val add = udf {
(a: Int) => a + notSer.value
}
Mean UDF function that Int => Int ?
Thanks,
Muthu
On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren <inv...@gmail.com> wrote:
> I am playing with spark 2.0
> What I tried to test is:
>
&g
.read.parquet(parquetFile).toJavaRDD.partitions.size()
res2: Int = 20
Can I suspect something with dynamic allocation perhaps?
Please advice,
Muthu
On Sat, Aug 6, 2016 at 3:23 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:
> 720 cores Wow. That is a hell of cores Muthu :)
>
> Ok le
On a side note, I do understand that 200 parquet part files for the above
2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18
parts or so.
Please advice,
Muthu
, but i don't
know how to split and map the row elegantly. Hence using it as RDD.
Thanks,
Muthu
On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng <mengdong0...@gmail.com> wrote:
> you can specify nullable in StructField
>
> On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar <bablo...@
t[0, org.apache.spark.sql.Row, true], top
level row object)
+- input[0, org.apache.spark.sql.Row, true]
Let me know if you would like me try to create a more simplified reproducer
to this problem. Perhaps I should not be using Option[T] for nullable
schema values?
Please advice,
Muthu
Does increasing the number of partition helps? You could try out something
3 times what you currently have.
Another trick i used was to partition the problem into multiple dataframes
and run them sequentially and persistent the result and then run a union on
the results.
Hope this helps.
On Fri,
nt from my Verizon Wireless 4G LTE smartphone
>
>
> ---- Original message
> From: Muthu Jayakumar <bablo...@gmail.com>
> Date: 01/22/2016 3:50 PM (GMT-05:00)
> To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" <
> sande...@rose-hulman.edu
DataFrame and udf. This may be more performant than doing an RDD
transformation as you'll only transform just the column that requires to be
changed.
Hope this helps.
On Thu, Jan 21, 2016 at 6:17 AM, Eli Super wrote:
> Hi
>
> I have a large size parquet file .
>
> I need
Thanks Micheal. Let me test it with a recent master code branch.
Also for every mapping step should I have to create a new case class? I
cannot use Tuple as I have ~130 columns to process. Earlier I had used a
Seq[Any] (actually Array[Any] to optimize on serialization) but processed
it using RDD
>export SPARK_WORKER_MEMORY=4g
May be you could increase the max heapsize on the worker? In case if the
OutOfMemory is for the driver, then you may want to set it up explicitly
for the driver.
Thanks,
On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote:
> Hello,
>
>
mutableRow.setNullAt(0);
/* 144 */ } else {
/* 145 */
/* 146 */ mutableRow.update(0, primitive1);
/* 147 */ }
/* 148 */
/* 149 */ return mutableRow;
/* 150 */ }
/* 151 */ }
/* 152 */
Thanks.
On Tue, Jan 12, 2016 at 11:35 AM, Muthu Jayakumar <bablo...@gmail.com>
wrote:
> Tha
this may make it strongly typed?
Thank you for looking into my email.
Thanks,
Muthu
On Mon, Jan 11, 2016 at 3:08 PM, Michael Armbrust <mich...@databricks.com>
wrote:
> Also, while extracting a value into Dataset using as[U] method, how could
>> I specify a custom encoder/translation
] method, how could I
specify a custom encoder/translation to case class (where I don't have the
same column-name mapping or same data-type mapping)?
Please advice,
Muthu
f.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO,
MemoryManager.DEFAULT_MEMORY_POOL_RATIO);).
I wonder how is this provided thru Apache Spark. Meaning, I see that
'TaskAttemptContext' seems to be the hint to provide this. But I am not
able to find a way I could provide this configuration.
Please advice,
Muthu
On
t can use the default
serializer provided by Spark.
Hope this helps.
Thanks,
Muthu
On Sat, Nov 14, 2015 at 10:18 PM, Netai Biswas <mail2efo...@gmail.com>
wrote:
> Hi,
>
> Thanks for your response. I will give a try with akka also, if you have
> any sample code or useful link please
You could try to use akka actor system with apache spark, if you are
intending to use it in online / interactive job execution scenario.
On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> You are probably trying to access the spring context from the
= +
logRecord);
}
Where do I create new JavaRDDString? DO I do it before this loop? How do I
create this JavaRDDString?
In the loop I am able to get every record and I am able to print them.
I appreciate any help here.
Thanks,
Muthu
;
}
});
return null;
}
-Original Message-
From: Sundaram, Muthu X. [mailto:muthu.x.sundaram@sabre.com]
Sent: Tuesday, July 22, 2014 10:24 AM
To: user@spark.apache.org; d...@spark.incubator.apache.org
Subject: Tranforming flume events using
, July 11, 2014 1:43 PM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: Re: writing FLume data to HDFS
What is the error you are getting when you say ??I was trying to write the
data to hdfs..but it fails…
TD
On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X.
muthu.x.sundaram
I am new to spark. I am trying to do the following.
Netcat--Flume--Spark streaming(process Flume Data)--HDFS.
My flume config file has following set up.
Source = netcat
Sink=avrosink.
Spark Streaming code:
I am able to print data from flume to the monitor. But I am struggling to
create a file.
63 matches
Mail list logo