Re: Best way to process this dataset

2018-06-18 Thread Georg Heiler
use pandas or dask If you do want to use spark store the dataset as parquet / orc. And then continue to perform analytical queries on that dataset. Raymond Xie schrieb am Di., 19. Juni 2018 um 04:29 Uhr: > I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment > is 20GB ssd

Best way to process this dataset

2018-06-18 Thread Raymond Xie
I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is 20GB ssd harddisk and 2GB RAM. The dataset comes with User ID: 987,994 Item ID: 4,162,024 Category ID: 9,439 Behavior type ('pv', 'buy', 'cart', 'fav') Unix Timestamp: span between November 25 to December 03, 2017 I

Re: convert array of values column to string column (containing serialised json) (SPARK-21513)

2018-06-18 Thread summersk
Resending with formatting hopefully fixed: Hello, SPARK-21513 proposes to support support using the to_json UDF on any

convert array of values column to string column (containing serialised json) (SPARK-21513)

2018-06-18 Thread summersk
Hello, SPARK-21513 proposes to support support using the to_json UDF on any column type, however it fails with the

Spark-Mongodb connector issue

2018-06-18 Thread ayan guha
Hi Guys I have a large mongodb collection with complex document structure. I an facing an issue when I am getting error as Can not cast Array to Struct. Value:BsonArray([]) The target column is indeed a struct. So the error makes sense. I am able to successfully read from another collection

Re: Spark batch job: failed to compile: java.lang.NullPointerException

2018-06-18 Thread ARAVIND SETHURATHNAM
Spark version is 2.2 and I think I am running into this issue https://issues.apache.org/jira/browse/SPARK-18016as the dataset schema is pretty huge and nested From: ARAVIND SETHURATHNAM Date: Monday, June 18, 2018 at 4:00 PM To: "user@spark.apache.org" Subject: Spark batch job: failed to

Repartition not working on a csv file

2018-06-18 Thread Abdeali Kothari
I am using Spark 2.3.0 and trying to read a CSV file which has 500 records. When I try to read it, spark says that it has two stages: 10, 11 and then they join into stage 12. This makes sense and is what I would expect, as I have 30 map-based UDFs after which i do a join, and run another 10 UDFs

load hbase data using spark

2018-06-18 Thread Lian Jiang
Hi, I am considering tools to load hbase data using spark. One choice is https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to be out-of-date (e.g. "This version of 1.0.0 requires Spark 1.4.0."). Which tool should I use for this purpose? Thanks for any hint.

Re: Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

2018-06-18 Thread Koert Kuipers
we use DataFrame and RDD. Dataset not only has issues with predicate pushdown, it also adds shufffles at times where it shouldn't. and there is some overhead from the encoders themselves, because under the hood it is still just Row objects. On Mon, Jun 18, 2018 at 5:00 PM, Valery Khamenya

Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

2018-06-18 Thread Valery Khamenya
Hi Spark gurus, I was surprised to read here: https://stackoverflow.com/questions/50129411/why-is-predicate-pushdown-not-used-in-typed-dataset-api-vs-untyped-dataframe-ap that filters are not pushed down in typed Datasets and one should rather stick to Dataframes. But writing code for

Re: Spark 2.4 release date

2018-06-18 Thread Jacek Laskowski
Hi, What about https://issues.apache.org/jira/projects/SPARK/versions/12342385? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams

Re: best practices to implement library of custom transformations of Dataframe/Dataset

2018-06-18 Thread Georg Heiler
I believe explicit is better than implicits, however as you mentioned the notation is very nice. Therefore, I suggest https://medium.com/@mrpowers/chaining-custom-dataframe-transformations-in-spark-a39e315f903c to use df.transform(myFunction) Valery Khamenya schrieb am Mo., 18. Juni 2018 um

Spark 2.4 release date

2018-06-18 Thread Li Gao
Hello, Do we know the estimate when Spark 2.4 will be GA? We are evaluating whether to back port some of 2.4 fixes into our 2.3 deployment. Thank you.

best practices to implement library of custom transformations of Dataframe/Dataset

2018-06-18 Thread Valery Khamenya
Dear Spark gurus, *Question*: what way would you recommend to shape a library of custom transformations for Dataframes/Datasets? *Details*: e.g., consider we need several custom transformations over the Dataset/Dataframe instances. For example injecting columns, apply specially tuned row

Zstd codec for writing dataframes

2018-06-18 Thread Nikhil Goyal
Hi guys, I was wondering if there is a way to compress files using zstd. It seems zstd compression can be used for shuffle data, is there a way to use it for output data as well? Thanks Nikhil

Fwd: StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu
*Correction, 60C2 * 3* -- Forwarded message -- From: Aakash Basu Date: Mon, Jun 18, 2018 at 4:15 PM Subject: StackOverFlow ERROR - Bulk interaction for many columns fail To: user Hi, When doing bulk interaction on around 60 columns, I want 3 columns to be created out of each

StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu
Hi, When doing bulk interaction on around 60 columns, I want 3 columns to be created out of each one of them, since it has a combination of 3, then it becomes 60N2 * 3, which creates a lot of columns. So, for a lesser than 50 - 60 columns, even though it takes time, it still works fine, but, for

is spark stream-stream joins in update mode targeted for 2.4?

2018-06-18 Thread kant kodali
Hi All, Is spark stream-stream joins in update mode targeted for 2.4? Thanks!