Re: share datasets across multiple spark-streaming applications for lookup

2017-11-02 Thread JG Perrin
Or Databaricks Delta (announced at Spark Summit) or IBM Event Store depending on the use case. On Oct 31, 2017, at 14:30, Joseph Pride > wrote: Folks: SnappyData. I’m fairly new to working with it myself, but it looks pretty

Spark as ETL, was: Re: Dose pyspark supports python3.6?

2017-11-02 Thread JG Perrin
Pros: No need for Scala skills, Java can be used. Other companies are already doing it. > Support Yarn execution But not only… Complex use-case for import can easily be done in Java (see https://spark-summit.org/eu-2017/events/extending-apache-sparks-ingestion-building-your-own-java-data-source/

Re: Is Spark suited for this use case?

2017-10-20 Thread JG Perrin
I have seen a similar scenario where we load data from a RDBMS into a NoSQL database… Spark made sense for velocity and parallel processing (and cost of licenses :) ). > On Oct 15, 2017, at 21:29, Saravanan Thirumalai > wrote: > > We are an Investment firm

Re: Java Rdd of String to dataframe

2017-10-20 Thread JG Perrin
SK, Have you considered: Dataset df = spark.read().json(dfWithStringRowsContainingJson); jg > On Oct 11, 2017, at 16:35, sk skk wrote: > > Can we create a dataframe from a Java pair rdd of String . I don’t have a > schema as it will be a dynamic Json. I gave

RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-10 Thread JG Perrin
Something along the line of: Dataset df = spark.read().json(jsonDf); ? From: kant kodali [mailto:kanth...@gmail.com] Sent: Saturday, October 07, 2017 2:31 AM To: user @spark Subject: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0? I

RE: Spark 2.2.0 Win 7 64 bits Exception while deleting Spark temp dir

2017-10-03 Thread JG Perrin
do you have a little more to share with us? maybe you can set another TEMP directory. are you getting a result? From: usa usa [mailto:usact2...@gmail.com] Sent: Tuesday, October 03, 2017 10:50 AM To: user@spark.apache.org Subject: Spark 2.2.0 Win 7 64 bits Exception while deleting Spark temp dir

RE: Quick one... AWS SDK version?

2017-10-03 Thread JG Perrin
Sorry Steve - I may not have been very clear: thinking about aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled with Spark. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Tuesday, October 03, 2017 2:20 PM To: JG Perrin <jper...@lumeris.com> Cc

RE: Quick one... AWS SDK version?

2017-10-03 Thread JG Perrin
Thanks Yash… this is helpful! From: Yash Sharma [mailto:yash...@gmail.com] Sent: Tuesday, October 03, 2017 1:02 AM To: JG Perrin <jper...@lumeris.com>; user@spark.apache.org Subject: Re: Quick one... AWS SDK version? Hi JG, Here are my cluster configs if it helps. Cheers. EMR: emr

Quick one... AWS SDK version?

2017-10-02 Thread JG Perrin
Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? Thanks! jg

RE: HDFS or NFS as a cache?

2017-10-02 Thread JG Perrin
[mailto:ste...@hortonworks.com] Sent: Saturday, September 30, 2017 6:10 AM To: JG Perrin <jper...@lumeris.com> Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org Subject: Re: HDFS or NFS as a cache? On 29 Sep 2017, at 20:03, JG Perrin <jper...@lumeris.

RE: Error - Spark reading from HDFS via dataframes - Java

2017-10-02 Thread JG Perrin
@Anastasios: just a word of caution, this is Spark 1.x CSV parser, there a few (minor) changes for Spark 2.x, you can have a look at http://jgp.net/2017/10/01/loading-csv-in-spark/. From: Anastasios Zouzias [mailto:zouz...@gmail.com] Sent: Sunday, October 01, 2017 2:05 AM To: Kanagha Kumar

RE: HDFS or NFS as a cache?

2017-09-29 Thread JG Perrin
You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. From: Alexander Czech [mailto:alexander.cz...@googlemail.com] Sent: Friday, September 29, 2017 8:15 AM To: user@spark.apache.org Subject: HDFS or NFS as a cache? I have

RE: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread JG Perrin
On a test system, you can also use something like Owncloud/Nextcloud/Dropbox to insure that the files are synchronized. Would not do it for TB of data ;) ... -Original Message- From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Friday, September 29, 2017 5:14 AM To: Gaurav1809

RE: Loading objects only once

2017-09-28 Thread JG Perrin
Maybe load the model on each executor’s disk and load it from there? Depending on how you use the data/model, using something like Livy and sharing the same connection may help? From: Naveen Swamy [mailto:mnnav...@gmail.com] Sent: Wednesday, September 27, 2017 9:08 PM To: user@spark.apache.org

RE: More instances = slower Spark job

2017-09-28 Thread JG Perrin
As the others have mentioned, your loading time might kill your benchmark… I am in a similar process right now, but I time each operation, load, process 1, process 2, etc. not always easy with lazy operators, but you can try to force operations with false collect and cache (for benchmarking

RE: Debugging Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-26 Thread JG Perrin
not using Yarn, just standalone cluster with 2 nodes here (physical, not even VM). network seems good between the nodes . From: ayan guha [mailto:guha.a...@gmail.com] Sent: Tuesday, September 26, 2017 10:39 AM To: JG Perrin <jper...@lumeris.com> Cc: user@spark.apache.org Subject: Re: Deb

Debugging Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-26 Thread JG Perrin
Hi, I get the infamous: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I run the app via Eclipse, connecting: SparkSession spark = SparkSession.builder() .appName("Converter -

[Structured Streaming] Multiple sources best practice/recommendation

2017-09-13 Thread JG Perrin
Hi, I have different files being dumped on S3, I want to ingest them and join them. What does sound better to you? Have one " directory" for all or one per file format? If I have one directory for all, can you get some metadata about the file, like its name? If multiple directory, how can I

RE: CSV write to S3 failing silently with partial completion

2017-09-07 Thread JG Perrin
Are you assuming that all partitions are of equal size? Did you try with more partitions (like repartitioning)? Does the error always happen with the last (or smaller) file? If you are sending to redshift, why not use the JDBC driver? -Original Message- From: abbim

RE: Problem with CSV line break data in PySpark 2.1.0

2017-09-05 Thread JG Perrin
Have you tried the built-in parser, not the databricks one (which is not really used anymore)? What is your original CSV looking like? What is your code looking like? There are quite a few options to read a CSV… From: Aakash Basu [mailto:aakash.spark@gmail.com] Sent: Sunday, September 03,

RE: from_json()

2017-08-30 Thread JG Perrin
apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) From: JG Perrin [mailto:jper...@

RE: from_json()

2017-08-28 Thread JG Perrin
Thanks Sam – this might be the solution. I will investigate! From: Sam Elamin [mailto:hussam.ela...@gmail.com] Sent: Monday, August 28, 2017 1:14 PM To: JG Perrin <jper...@lumeris.com> Cc: user@spark.apache.org Subject: Re: from_json() Hi jg, Perhaps I am misunderstanding you, but if yo

from_json()

2017-08-28 Thread JG Perrin
Is there a way to not have to specify a schema when using from_json() or infer the schema? When you read a JSON doc from disk, you can infer the schema. Should I write it to disk before (ouch)? jg __ This electronic

RE: add me to email list

2017-08-28 Thread JG Perrin
Hey Mike, You need to do it yourself, it’s really easy: http://spark.apache.org/community.html. hih jg From: Michael Artz [mailto:michaelea...@gmail.com] Sent: Monday, August 28, 2017 7:43 AM To: user@spark.apache.org Subject: add me to email list Hi, Please add me to the email list Mike

RE: Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-24 Thread JG Perrin
Thanks Michael – this is a great article… very helpful From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, August 23, 2017 4:33 PM To: JG Perrin <jper...@lumeris.com> Cc: user@spark.apache.org Subject: Re: Joining 2 dataframes, getting result as nested list/str

Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-23 Thread JG Perrin
Hi folks, I am trying to join 2 dataframes, but I would like to have the result as a list of rows of the right dataframe (dDf in the example) in a column of the left dataframe (cDf in the example). I made it work with one column, but having issues adding more columns/creating a row(?). Seq