+user list
We are happy to announce the availability of Spark 2.4.1!
Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
maintenance branch of Spark. We strongly recommend all 2.4.0 users to
upgrade to this stable release.
In Apache Spark 2.4.1, Scala 2.12 support is GA, and
oops, sorry for the confusion. I mean "20% of the size of your input data
set" allocated to Alluxio as memory resource as the starting point.
after that, you can checkout the cache hit ratio into Alluxio space based
on the metrics collected in Alluxio web UI
Hi Andy,
It really depends on your workloads. I would suggest to allocate 20% of the
size of your input data set
as the starting point and see how it works.
Also depending on your data source as the under store of Alluxio, if it is
remote (e.g., cloud storage like S3 or GCS),
you can perhaps use
Hi,
I am new to spark and no SQL databases.
So Please correct me if I am wrong.
Since I will be accessing multiple columns (almost 20-30 columns) of a row,
I will have to go with rowbased db instead column based right!
May be I can use Avro in this case. Does spark go well with Avroro? I will
The Datasets is in a fairly popular data format called libsvm data format -
popularized by the libsvm library.
http://svmlight.joachims.org - The 'How to Use' section describes the file
format.
XGBoost uses the same file format and their documentation is here -
Hi,
I am trying to use apache spark's decision tree classifier. I am
trying to implement the method found in
https://spark.apache.org/docs/1.5.1/ml-decision-tree.html 's
classification example. I found the dataset at
https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt
Have you considered Presto with an Oracle connector?
From: Teemu Heikkilä
Date: Thursday, April 4, 2019 at 12:28 PM
To: Prasad Bhalerao
Cc: Jason Nerothin , user
Subject: Re: reporting use case
Based on your answers, I would consider using the update stream to update
actual snapshots ie. by
Based on your answers, I would consider using the update stream to update
actual snapshots ie. by joining the data
Ofcourse now it depends on how the update stream has been implemented how to
get the data in spark.
Could you tell little bit more about that?
- Teemu
> On 4 Apr 2019, at 22.23,
Hi ,
I can create a view on these tables but the thing is I am going to need
almost every column from these tables and I have faced issues with oracle
views on such a large tables which involves joins. Some how oracle used to
choose not so correct execution plan.
Can you please tell me how
Hi Prasad,
Could you create an Oracle-side view that captures only the relevant
records and the use Spark JDBC connector to load the view into Spark?
On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao
wrote:
> Hi,
>
> I am exploring spark for my Reporting application.
> My use case is as
Hi,
I am exploring spark for my Reporting application.
My use case is as follows...
I have 4-5 oracle tables which contains more than 1.5 billion rows. These
tables are updated very frequently every day. I don't have choice to change
database technology. So this data is going to remain in Oracle
Abdeali, Jason:
while submitting spark job num-executors 8, num-cores 8, driver-memory 14g
and executor-memory 14g, the size of total data was processed were 5 GB
with 100+ aggregation and 50+ different joins at various data frame level.
So it is really hard to tell specific number of
Its running in local mode. I’ve ran it in PyCharm and JupyterLab. I’ve
restarted the kernel several times.
B.
From: Abdeali Kothari
Sent: Thursday, April 4, 2019 06:35
To: Adaryl Wakefield
Cc: user@spark.apache.org
Subject: Re: pickling a udf
The syntax looks right.
Are you still getting
My thinking is that if you run everything in one partition - say 12 GB -
then you don't experience the partitioning problem - one partition will
have all duplicates.
If that's not the case, there are other options, but would probably require
a design change.
On Thu, Apr 4, 2019 at 8:46 AM Jason
Hi all,
I am trying to make our application check the Spark version before
attempting to submit a job, to ensure the user is on a new enough
version (in our case, 2.3.0 or later). I realize that there is a
--version argument to spark-shell, but that prints the version next to
some ASCII art so a
Have you tried something like this?
spark.conf.set("spark.sql.shuffle.partitions", "5" )
On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote:
> Hi Sparkers,
>
> I noticed that in my spark application, the number of tasks in the first
> stage is equal to the number of files read by the
How much memory do you have per partition?
On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri
wrote:
> I will get the information and will share with you.
>
> On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari
> wrote:
>
>> How long does it take to do the window solution ? (Also mention how many
>>
I will get the information and will share with you.
On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari
wrote:
> How long does it take to do the window solution ? (Also mention how many
> executors was your spark application using on average during that time)
> I am not aware of anything that is
The syntax looks right.
Are you still getting the error when you open a new python session and run
this same code ?
Are you running on your laptop with spark local mode or are you running
this on a yarn based cluster ?
It does seem like something in your python session isnt getting serialized
How long does it take to do the window solution ? (Also mention how many
executors was your spark application using on average during that time)
I am not aware of anything that is faster. When I ran is on my data ~8-9GB
I think it took less than 5 mins (don't remember exact time)
On Thu, Apr 4,
Are we not supposed to be using udfs anymore? I copied an example straight from
a book and I'm getting weird results and I think it's because the book is using
a much older version of Spark. The code below is pretty straight forward but
I'm getting an error none the less. I've been doing a
Dears,
I'm working on a project that should integrate spark streaming with kafka
using java.
Currently the official documentation is confusing, it's not clear whether
"spark-streaming-kafka-0-10" is safe to be used in production environment
or not.
According to "Spark Streaming + Kafka
Thanks for awesome clarification / explanation.
I have cases where update_time can be same.
I am in need of suggestions, where I have very large data like 5 GB, this
window based solution which I mentioned is taking very long time.
Thanks again.
On Thu, Apr 4, 2019 at 12:11 PM Abdeali Kothari
So, the above code for min() worked for me fine in general, but there was
one corner case where it failed.
Which was when I have something like:
invoice_id=1, update_time=*2018-01-01 15:00:00.000*
invoice_id=1, update_time=*2018-01-01 15:00:00.000*
invoice_id=1, update_time=2018-02-03 14:00:00.000
Hello Abdeali, Thank you for your response.
Can you please explain me this line, And the dropDuplicates at the end
ensures records with two values for the same 'update_time' don't cause
issues.
Sorry I didn't get quickly. :)
On Thu, Apr 4, 2019 at 10:41 AM Abdeali Kothari
wrote:
> I've faced
25 matches
Mail list logo