[ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-04 Thread DB Tsai
+user list We are happy to announce the availability of Spark 2.4.1! Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4.0 users to upgrade to this stable release. In Apache Spark 2.4.1, Scala 2.12 support is GA, and

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
oops, sorry for the confusion. I mean "20% of the size of your input data set" allocated to Alluxio as memory resource as the starting point. after that, you can checkout the cache hit ratio into Alluxio space based on the metrics collected in Alluxio web UI

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
Hi Andy, It really depends on your workloads. I would suggest to allocate 20% of the size of your input data set as the starting point and see how it works. Also depending on your data source as the under store of Alluxio, if it is remote (e.g., cloud storage like S3 or GCS), you can perhaps use

Re: reporting use case

2019-04-04 Thread Prasad Bhalerao
Hi, I am new to spark and no SQL databases. So Please correct me if I am wrong. Since I will be accessing multiple columns (almost 20-30 columns) of a row, I will have to go with rowbased db instead column based right! May be I can use Avro in this case. Does spark go well with Avroro? I will

Re: Qn about decision tree apache spark java

2019-04-04 Thread Abdeali Kothari
The Datasets is in a fairly popular data format called libsvm data format - popularized by the libsvm library. http://svmlight.joachims.org - The 'How to Use' section describes the file format. XGBoost uses the same file format and their documentation is here -

Qn about decision tree apache spark java

2019-04-04 Thread Serena S Yuan
Hi, I am trying to use apache spark's decision tree classifier. I am trying to implement the method found in https://spark.apache.org/docs/1.5.1/ml-decision-tree.html 's classification example. I found the dataset at https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt

Re: Re: reporting use case

2019-04-04 Thread Hall, Steven
Have you considered Presto with an Oracle connector? From: Teemu Heikkilä Date: Thursday, April 4, 2019 at 12:28 PM To: Prasad Bhalerao Cc: Jason Nerothin , user Subject: Re: reporting use case Based on your answers, I would consider using the update stream to update actual snapshots ie. by

Re: reporting use case

2019-04-04 Thread Teemu Heikkilä
Based on your answers, I would consider using the update stream to update actual snapshots ie. by joining the data Ofcourse now it depends on how the update stream has been implemented how to get the data in spark. Could you tell little bit more about that? - Teemu > On 4 Apr 2019, at 22.23,

Re: reporting use case

2019-04-04 Thread Prasad Bhalerao
Hi , I can create a view on these tables but the thing is I am going to need almost every column from these tables and I have faced issues with oracle views on such a large tables which involves joins. Some how oracle used to choose not so correct execution plan. Can you please tell me how

Re: reporting use case

2019-04-04 Thread Jason Nerothin
Hi Prasad, Could you create an Oracle-side view that captures only the relevant records and the use Spark JDBC connector to load the view into Spark? On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao wrote: > Hi, > > I am exploring spark for my Reporting application. > My use case is as

reporting use case

2019-04-04 Thread Prasad Bhalerao
Hi, I am exploring spark for my Reporting application. My use case is as follows... I have 4-5 oracle tables which contains more than 1.5 billion rows. These tables are updated very frequently every day. I don't have choice to change database technology. So this data is going to remain in Oracle

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
Abdeali, Jason: while submitting spark job num-executors 8, num-cores 8, driver-memory 14g and executor-memory 14g, the size of total data was processed were 5 GB with 100+ aggregation and 50+ different joins at various data frame level. So it is really hard to tell specific number of

RE: pickling a udf

2019-04-04 Thread Adaryl Wakefield
Its running in local mode. I’ve ran it in PyCharm and JupyterLab. I’ve restarted the kernel several times. B. From: Abdeali Kothari Sent: Thursday, April 4, 2019 06:35 To: Adaryl Wakefield Cc: user@spark.apache.org Subject: Re: pickling a udf The syntax looks right. Are you still getting

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
My thinking is that if you run everything in one partition - say 12 GB - then you don't experience the partitioning problem - one partition will have all duplicates. If that's not the case, there are other options, but would probably require a design change. On Thu, Apr 4, 2019 at 8:46 AM Jason

Why does this spark-shell invocation get suspended due to tty output?

2019-04-04 Thread Jeff Evans
Hi all, I am trying to make our application check the Spark version before attempting to submit a job, to ensure the user is on a new enough version (in our case, 2.3.0 or later). I realize that there is a --version argument to spark-shell, but that prints the version next to some ASCII art so a

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-04 Thread Jason Nerothin
Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" ) On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote: > Hi Sparkers, > > I noticed that in my spark application, the number of tasks in the first > stage is equal to the number of files read by the

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
How much memory do you have per partition? On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri wrote: > I will get the information and will share with you. > > On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari > wrote: > >> How long does it take to do the window solution ? (Also mention how many >>

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
I will get the information and will share with you. On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari wrote: > How long does it take to do the window solution ? (Also mention how many > executors was your spark application using on average during that time) > I am not aware of anything that is

Re: pickling a udf

2019-04-04 Thread Abdeali Kothari
The syntax looks right. Are you still getting the error when you open a new python session and run this same code ? Are you running on your laptop with spark local mode or are you running this on a yarn based cluster ? It does seem like something in your python session isnt getting serialized

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Abdeali Kothari
How long does it take to do the window solution ? (Also mention how many executors was your spark application using on average during that time) I am not aware of anything that is faster. When I ran is on my data ~8-9GB I think it took less than 5 mins (don't remember exact time) On Thu, Apr 4,

pickling a udf

2019-04-04 Thread Adaryl Wakefield
Are we not supposed to be using udfs anymore? I copied an example straight from a book and I'm getting weird results and I think it's because the book is using a much older version of Spark. The code below is pretty straight forward but I'm getting an error none the less. I've been doing a

Why "spark-streaming-kafka-0-10" is still experimental?

2019-04-04 Thread Doaa Medhat
Dears, I'm working on a project that should integrate spark streaming with kafka using java. Currently the official documentation is confusing, it's not clear whether "spark-streaming-kafka-0-10" is safe to be used in production environment or not. According to "Spark Streaming + Kafka

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
Thanks for awesome clarification / explanation. I have cases where update_time can be same. I am in need of suggestions, where I have very large data like 5 GB, this window based solution which I mentioned is taking very long time. Thanks again. On Thu, Apr 4, 2019 at 12:11 PM Abdeali Kothari

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Abdeali Kothari
So, the above code for min() worked for me fine in general, but there was one corner case where it failed. Which was when I have something like: invoice_id=1, update_time=*2018-01-01 15:00:00.000* invoice_id=1, update_time=*2018-01-01 15:00:00.000* invoice_id=1, update_time=2018-02-03 14:00:00.000

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
Hello Abdeali, Thank you for your response. Can you please explain me this line, And the dropDuplicates at the end ensures records with two values for the same 'update_time' don't cause issues. Sorry I didn't get quickly. :) On Thu, Apr 4, 2019 at 10:41 AM Abdeali Kothari wrote: > I've faced