Re: [Structured Streaming] More than 1 streaming in a code

2018-04-13 Thread spark receiver
Hi Panagiotis , Wondering you solved the problem or not? Coz I met the same issue today. I’d appreciate so much if you could paste the code snippet if it’s working . Thanks. > 在 2018年4月6日,上午7:40,Aakash Basu 写道: > > Hi Panagiotis, > > I did that, but it still

avoid duplicate records when appending new data to a parquet

2018-04-13 Thread Lian Jiang
I have a parquet which has an id field which is the hash of the composite key fields. Is it possible to maintain the uniqueness of the id field when appending new data which may duplicate with existing records in the parquet? Thanks!

Re: Structured Streaming on Kubernetes

2018-04-13 Thread Anirudh Ramanathan
+ozzieba who was experimenting with streaming workloads recently. +1 to what Matt said. Checkpointing and driver recovery is future work. Structured streaming is important, and it would be good to get some production experiences here and try and target improving the feature's support on K8s for

Re: Spark parse fixed length file [Java]

2018-04-13 Thread Georg Heiler
I am not 100% sure if spark is smart enough to achieve this using a single pass over the data. If not you could create a java udf for this which correctly parses all the columns at once. Otherwise you could enable Tungsten off heap memory which might speed things up. lsn24

Re: Structured Streaming on Kubernetes

2018-04-13 Thread Matt Cheah
We don’t provide any Kubernetes-specific mechanisms for streaming, such as checkpointing to persistent volumes. But as long as streaming doesn’t require persisting to the executor’s local disk, streaming ought to work out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Hi Gene - Are you saying that I just need to figure out how to get the Alluxio jar into the classpath of my parent application? If it shows up in the classpath then Spark will automatically know that it needs to use it when communicating with Alluxio? Apologies for going back-and-forth on

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Gene Pang
Hi Jason, Alluxio does work with Spark in master=local mode. This is because both spark-submit and spark-shell have command-line options to set the classpath for the JVM that is being started. If you are not using spark-submit or spark-shell, you will have to figure out how to configure that JVM

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Ok thanks - I was basing my design on this: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html Wherein it says: Once the SparkSession is instantiated, you can

Re: Live Stream Code Reviews :)

2018-04-13 Thread Holden Karau
Thank you :) Just a reminder this is going to start in under 20 minutes. If anyone has a PR they'd live reviewed please respond and I'll add it to the list (otherwise I'll go stick to the normal list of folks who have opted in to live reviews). On Thu, Apr 12, 2018 at 2:08 PM, Gourav Sengupta

Re: Shuffling Data After Union and Write

2018-04-13 Thread Rahul Nandi
You can put a new column say order to each of the DF having 1, 2 and 3 for df1, df2 and df3 respectively. Then you can sort the data based on the order. On Fri 13 Apr, 2018, 21:56 SNEHASISH DUTTA, wrote: > Hi, > > I am currently facing an issue , while performing union

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Marcelo Vanzin
There are two things you're doing wrong here: On Thu, Apr 12, 2018 at 6:32 PM, jb44 wrote: > Then I can add the alluxio client library like so: > sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT) First one, you can't modify JVM configuration after it

Spark parse fixed length file [Java]

2018-04-13 Thread lsn24
Hello, We are running into issues while trying to process fixed length files using spark. The approach we took is as follows: 1. Read the .bz2 file into a dataset from hdfs using spark.read().textFile() API.Create a temporary view. Dataset rawDataset =

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Thanks - I’ve seen this SO post, it covers spark-submit, which I am not using. Regarding the ALLUXIO_SPARK_CLIENT variable, it is located on the machine that is running the job which spawns the master=local spark. According to the Spark documentation, this should be possible, but it appears it

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread yohann jardin
Hey Jason, Might be related to what is behind your variable ALLUXIO_SPARK_CLIENT and where is located the lib (is it on HDFS, on the node that submits the job, or locally to all spark workers?) There is a great post on SO about it: https://stackoverflow.com/a/37348234 We might as well check

Shuffling Data After Union and Write

2018-04-13 Thread SNEHASISH DUTTA
Hi, I am currently facing an issue , while performing union on three data fames say df1,df2,df3 once the operation is performed and I am trying to save the data , the data is getting shuffled so the ordering of data in df1,df2,df3 are not maintained. When I save the data as text/csv file the

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
I do, and this is what I will fall back to if nobody has a better idea :) I was just hoping to get this working as it is much more convenient for my testing pipeline. Thanks again for the help > On Apr 13, 2018, at 11:33 AM, Geoff Von Allmen wrote: > > Ok - `LOCAL`

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Geoff Von Allmen
Ok - `LOCAL` makes sense now. Do you have the option to still use `spark-submit` in this scenario, but using the following options: ```bash --master "local[*]" \ --deploy-mode "client" \ ... ``` I know in the past, I have setup some options using `.config("Option", "value")` when creating the

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Hi Geoff - Appreciate the help here - I do understand what you’re saying below. And I am able to get this working when I submit a job to a local cluster. I think part of the issue here is that there’s ambiguity in the terminology. When I say “LOCAL” spark, I mean an instance of spark that is

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Geoff Von Allmen
I fought with a ClassNotFoundException for quite some time, but it was for kafka. The final configuration that got everything working was running spark-submit with the following options: --jars "/path/to/.ivy2/jars/package.jar" \ --driver-class-path "/path/to/.ivy2/jars/package.jar" \ --conf

Task failure to read input files

2018-04-13 Thread Srikanth
Hello, I'm running Spark job on AWS EMR that reads many lzo files from a S3 bucket partitioned by date. Sometimes I see errors in logs similar to 18/04/13 11:53:52 WARN TaskSetManager: Lost task 151177.0 in stage 43.0 (TID 1516123, ip-10-10-2-6.ec2.internal, executor 57): java.io.IOException:

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread jb44
Haoyuan - As I mentioned below, I've been through the documentation already. It has not helped me to resolve the issue. Here is what I have tried so far: - setting extraClassPath as explained below - adding fs.alluxio.impl through sparkconf - adding spark.sql.hive.metastore.sharedPrefixes

Re: Structured Streaming on Kubernetes

2018-04-13 Thread Tathagata Das
Structured streaming is stable in production! At Databricks, we and our customers collectively process almost 100s of billions of records per day using SS. However, we are not using kubernetes :) Though I don't think it will matter too much as long as kubes are correctly provisioned+configured

Passing Hive Context to FPGrowth.

2018-04-13 Thread Sbf xyz
Hi, I am using Apache Spark 2.2 and mllib library in Python. I have to pass a Hive context to FPGrowth algorithm. For that, I converted a Df to RDD. I am struggling with some pickling errors. After going through stack overflow. It seems we need to convert an RDD to pipelineRDD. Could anyone

Transforming json string in structured streaming problem

2018-04-13 Thread Junfeng Chen
Hi all, I need to read some string data in json format from kafka, and convert them to dataframe and write to parquet file at last. But now I meet some problems. The spark.readStream().json() can only support json file on a specified location, cannot support Dataset like spark.read.json. I found

Structured Streaming on Kubernetes

2018-04-13 Thread Krishna Kalyan
Hello All, We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). It would be awesome if the spark community could share their experience around this. I would like to know more about you production experience and the monitoring tools you are using. Since spark on kubernetes