nd 2.
>> compared to Dataset's mapPartitions / map function?
>>
>> Enrico
>>
>>
>> Am 12.07.22 um 22:13 schrieb Muthu Jayakumar:
>>
>> Hello Enrico,
>>
>> Thanks for the reply. I found that I would have to use `mapPartitions`
>>
cbh52che104rwy603sr|
> |id-01f7pqqbxejt3ef4ap9qcs78m5|content of
> gs://bucket1/path/to/id-01g4he5cbqmdv7dnx46sebs0gt/file_result.json|id-2-01g4he5cbqmdv7dnx46sebs0gt|
> |id-01f7pqqbynh895ptpjjfxvk6dc|content of
> gs://bucket1/path/to/id-01g4he5cbx1kwhgvdme1s560dw/file_re
Hello there,
I have a dataframe with the following...
+-+---+---+
|entity_id|file_path
|other_useful_id|
+
complete
> c3f1fbf102b7: Pull complete
> 262868e4544c: Pull complete
> 1c0fec43ba3f: Pull complete
> Digest:
> sha256:412c52d88d77ea078c50ed4cf8d8656d6448b1c92829128e1c6aab6687ce0998
> *Status: Downloaded newer image for openjdk:8-jre-slim*
> ---> 8f867fdbd02f
>
> What you see at your side?
Hello there,
While using docker-image-tool (for Spark 3.1.1) it seems to not accept
`java_image_tag` property. The docker image default to JRE 11. Here is what
I am running from the command line.
$ spark/bin/docker-image-tool.sh -r docker.io/sample-spark -b
java_image_tag=8-jre-slim -t 3.1.1 buil
ge or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 17
I suspect the spark job is somehow having an incorrect (newer) version of
json4s in the classpath. json4s 3.5.3 is the utmost version that can be
used.
Thanks,
Muthu
On Mon, Feb 17, 2020, 06:43 Mich Talebzadeh
wrote:
> Hi,
>
> Spark version 2.4.3
> Hbase 1.2.7
>
> Data is stored in Hbase as Jso
If you would require higher precision, you may have to write a custom udaf.
In my case, I ended up storing the data as a key-value ordered list of
histograms.
Thanks
Muthu
On Mon, Nov 11, 2019, 20:46 Patrick McCarthy
wrote:
> Depending on your tolerance for error you could also use
> percentile
>I am running a spark job with 20 cores but i did not understand why my
application get 1-2 cores on couple of machines why not it just run on two
nodes like node1=16 cores and node 2=4 cores . but cores are allocated like
node1=2 node =1-node 14=1 like that.
I believe that's the intended
Hello there,
I have a basic question with how the number of tasks are determined per
spark job.
Let's say the scope of this discussion around parquet and Spark 2.x.
1. I thought that, the number of jobs is proportional to the number of part
files that exist. Is this correct?
2. I noticed that for
Perhaps use of generic StructType may work in your situation of being
language agnostic? case-classes are backed by implicits to provide type
conversions into columnar.
My 2 cents.
Thanks,
Mutu
On Mon, Jan 7, 2019 at 4:13 AM yeikel valdes wrote:
>
>
> Forwarded Message ===
The error reads as Precondition.checkArgument() method is on an incorrect
parameter signature.
Could you check to see how many jars (before the Uber jar), actually
contain this method signature?
I smell an issue with jar version conflict or similar.
Thanks
Muthu
On Thu, Dec 20, 2018, 02:40 Mich T
The error means that, you are missing commons-configuration-version.jar
from the classpath of the driver/worker.
Thanks,
Muthu
On Sat, Sep 29, 2018 at 11:55 PM yuvraj singh <19yuvrajsing...@gmail.com>
wrote:
> Hi , i am getting this error please help me .
>
>
> 18/09/30 05:14:44 INFO Client:
>
A naive workaround may be to transform the json4s JValue to String (using
something like compact()) and process it as String? Once you are done with
the last action, you could write it back as JValue (using something like
parse())
Thanks,
Muthu
On Wed, Sep 19, 2018 at 6:35 AM Arko Provo Mukherjee
I generally write to Parquet when I want to repeat the operation of reading
data and perform different operations on it every time. This would save db
time for me.
Thanks
Muthu
On Thu, Jul 19, 2018, 18:34 amin mohebbi
wrote:
> We do have two big tables each includes 5 billion of rows, so my que
Hello there,
I am trying to check to see CDB is available for Apache Spark. I could
currently use CDB using Postgres driver. But I would like to check to see
if there are any specialized drivers that I can use which optimizes for
predicate-push-down and other optimizations pertaining to data-local
ot;
>
> From the links, you pointed out. It looks like Scala 2.11.12 is compliant
> with Java 9/10?
>
> Thanks!
>
>
>
> On Sun, Apr 1, 2018 at 7:50 AM, Muthu Jayakumar
> wrote:
>
>> Short answer may be no. Spark runs on Scala 2.11. Even Scala 2.12 is also
>>
Short answer may be no. Spark runs on Scala 2.11. Even Scala 2.12 is also
not fully Java 9 compliant. For more info...
http://docs.scala-lang.org/overviews/jdk-compatibility/overview.html ---
check the last section.
https://issues.apache.org/jira/browse/SPARK-14220
On a side note, if some coming v
The problem with 'spark.sql.shuffle.partitions' is that, it needs to be set
before spark session is create (I guess?). But ideally, I want to partition
by column during a join / group-by (something roughly like
repartitionBy(partitionExpression: Column*) from
https://spark.apache.org/docs/latest/ap
Hello there,
I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform
a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an
optional Partition-Strategy (with is number of partitions or Partitioner)
b. join (of PairRDDFunctions)
wrote:
> Are you launching SparkSession from a MicroService or through spark-submit
> ?
>
> On Sun, Jun 4, 2017 at 11:52 PM, Muthu Jayakumar
> wrote:
>
>> Hello Kant,
>>
>> >I still don't understand How SparkSession can use Akka to communicate
>
Hello Kant,
>I still don't understand How SparkSession can use Akka to communicate with
SparkCluster?
Let me use your initial requirement as a way to illustrate what I mean --
i.e, "I want my Micro service app to be able to query and access data on
HDFS"
In order to run a query say a DF query (equ
One drastic suggestion can be to write a simple microservice using Akka and
create a SparkSession (during the start of vm) and pass it around. You can
look at SparkPI for sample source code to start writing your microservice.
In my case, I used akka http to wrap my business requests and transform
t
Hello there,
I am trying to understand the difference between the following
reparition()...
a. def repartition(partitionExprs: Column*): Dataset[T]
b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
c. def repartition(numPartitions: Int): Dataset[T]
My understanding is th
t; need a fast data store since you are doing a batch-like processing (reading
> from Parquet files) and it is possibly to control this part fully. And it
> also seems like you want to use ES. You can try to reduce the number of
> Spark executors to throttle the writes to ES.
>
>
s. Since you mention that your queries are going to be simple you can
> define your indexes in the materialized views according to how you want to
> query the data.
>
> Thanks,
> Shiva
>
>
>
> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar
> wrote:
>
>>
Hello Vincent,
Cassandra may not fit my bill if I need to define my partition and other
indexes upfront. Is this right?
Hello Richard,
Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
then the connector to Apache Spark did not support Spark 2.0.
Another drastic thought ma
This worked. Thanks for the tip Michael.
Thanks,
Muthu
On Thu, Feb 16, 2017 at 12:41 PM, Michael Armbrust
wrote:
> The toString method of Dataset.queryExecution includes the various plans.
> I usually just log that directly.
>
> On Thu, Feb 16, 2017 at 8:26 AM, Muthu Jayaku
Hello there,
I am trying to write to log-line a dataframe/dataset queryExecution and/or
its logical plan. The current code...
def explain(extended: Boolean): Unit = {
val explain = ExplainCommand(queryExecution.logical, extended = extended)
sparkSession.sessionState.executePlan(explain).exec
I guess, this may help in your case?
https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view
Thanks,
Muthu
On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:
> Dear all,
>
> Here is a requirement I am thinking of implement
Adding to Lars Albertsson & Miguel Morales, I am hoping to see how
well scalameta would branch down into support for macros that can rid away
sizable DI problems and for the reminder having a class type as args as Miguel
Morales mentioned.
Thanks,
On Wed, Dec 28, 2016 at 6:41 PM, Miguel Morales
Depending on your use case, 'df.withColumn("my_existing_or_new_col",
lit(0l))' could work?
On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren
wrote:
> Thanks for your answer. I have been searching the API for doing that
> but I could not find how to do it?
>
> Could you give me a code snippet?
k.sql.Dataset.(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Please advice,
> Muthu
>
ote:
> What is the issue you see when unioning?
>
> On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar
> wrote:
>
>> Hello Michael,
>>
>> Thank you for looking into this query. In my case there seem to be an
>> issue when I union a parquet file read from disk
lity of the column?
>
> On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar
> wrote:
>
>> Hello there,
>>
>> I am trying to understand how and when does DataFrame (or Dataset) sets
>> nullable = true vs false on a schema.
>>
>> Here is my observation from a
Hello there,
I am trying to understand how and when does DataFrame (or Dataset) sets
nullable = true vs false on a schema.
Here is my observation from a sample code I tried...
scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c",
2.0d))).toDF("col1", "col2", "col3").withColumn
Hello Hao Ren,
Doesn't the code...
val add = udf {
(a: Int) => a + notSer.value
}
Mean UDF function that Int => Int ?
Thanks,
Muthu
On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren wrote:
> I am playing with spark 2.0
> What I tried to test is:
>
> Create a UDF in which there is a non serial
echnical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 6 August 2016 at 23:09, Muthu Jayakumar wrote:
>
>> Hello Dr Mich Talebzadeh,
>>
>> Tha
Hello there,
I am trying to understand how I could improve (or increase) the parallelism
of tasks that run for a particular spark job.
Here is my observation...
scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size()
25
> hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l
200
i don't
know how to split and map the row elegantly. Hence using it as RDD.
Thanks,
Muthu
On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng wrote:
> you can specify nullable in StructField
>
> On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar
> wrote:
>
>> Hello there,
&g
Hello there,
I am using Spark 2.0.0 to create a parquet file using a text file with
Scala. I am trying to read a text file with bunch of values of type string
and long (mostly). And all the occurrences can be null. In order to support
nulls, all the values are boxed with Option (ex:- Option[String
n Wireless 4G LTE smartphone
>
>
> ---- Original message
> From: Muthu Jayakumar
> Date: 01/22/2016 3:50 PM (GMT-05:00)
> To: Darren Govoni , "Sanders, Isaac B" <
> sande...@rose-hulman.edu>, Ted Yu
> Cc: user@spark.apache.org
> Subject: Re: 10hrs of Scheduler
Does increasing the number of partition helps? You could try out something
3 times what you currently have.
Another trick i used was to partition the problem into multiple dataframes
and run them sequentially and persistent the result and then run a union on
the results.
Hope this helps.
On Fri,
DataFrame and udf. This may be more performant than doing an RDD
transformation as you'll only transform just the column that requires to be
changed.
Hope this helps.
On Thu, Jan 21, 2016 at 6:17 AM, Eli Super wrote:
> Hi
>
> I have a large size parquet file .
>
> I need to cast the whole colu
3 */ mutableRow.setNullAt(0);
/* 144 */ } else {
/* 145 */
/* 146 */ mutableRow.update(0, primitive1);
/* 147 */ }
/* 148 */
/* 149 */ return mutableRow;
/* 150 */ }
/* 151 */ }
/* 152 */
Thanks.
On Tue, Jan 12, 2016 at 11:35 AM, Muthu Jayakumar
wrote:
> Thanks Micheal. Le
Thanks Micheal. Let me test it with a recent master code branch.
Also for every mapping step should I have to create a new case class? I
cannot use Tuple as I have ~130 columns to process. Earlier I had used a
Seq[Any] (actually Array[Any] to optimize on serialization) but processed
it using RDD (
>export SPARK_WORKER_MEMORY=4g
May be you could increase the max heapsize on the worker? In case if the
OutOfMemory is for the driver, then you may want to set it up explicitly
for the driver.
Thanks,
On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote:
> Hello,
>
> I've a 5 nodes cluster whic
Hello Michael,
Thank you for the suggestion. This should do the trick for column names.
But how could I transform columns value type? Do I have to use an UDF? In
case if I use UDF, then the other question I may have is pertaining to the
map step in dataset, where I am running into an error when I
Hello there,
While looking at the features of Dataset, it seem to provide an alternative
way towards udf and udaf. Any documentation or sample code snippet to write
this would be helpful in rewriting existing UDFs into Dataset mapping step.
Also, while extracting a value into Dataset using as[U] m
Thanks Ewan Leith. This seems like a good start, as it seem to match up to
the symptoms I am seeing :).
But, how do I specify "parquet.memory.pool.ratio"?
Parquet code seem to take this parameter from
ParquetOutputFormat.getRecordWriter()
(ref code: float maxLoadconf.getFloat(ParquetOutputFormat.M
String t) throws Exception {
> springBean.someAPI(t); // here we will have db transaction as
> well.
> }
> });}
>
> Thanks,
> Netai
>
> On Sat, Nov 14, 2015 at 10:40 PM, Muthu Jayakumar
> wrote:
>
>> You could try to use akka actor system
You could try to use akka actor system with apache spark, if you are
intending to use it in online / interactive job execution scenario.
On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> You are probably trying to access the spring context from the execut
52 matches
Mail list logo