Spark ML DAG Pipelines

2017-09-07 Thread Srikanth Sampath
Hi Spark Experts, Can someone point me to some examples for non-linear (DAG) ML pipelines. That would be of great help. Thanks much in advance -Srikanth

Re: Chaining Spark Streaming Jobs

2017-09-07 Thread Sunita Arvind
Thanks for your response Michael Will try it out. Regards Sunita On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust wrote: > If you use structured streaming and the file sink, you can have a > subsequent stream read using the file source. This will maintain exactly >

Re: [Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-07 Thread Denis Magda
Hello Anjaneya, Marco, Honestly, Iā€™m not aware if the video broadcasting or recording is planned. Could you go to the meetup page [1] and raise the question there? Just in case, here is you can find a list of all upcoming Ignite related events [2]. Probably some of them will be in close

Re: [Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-07 Thread Marco Mistroni
Hi Will there be a podcast to view afterwards for remote EMEA users? Kr On Sep 7, 2017 12:15 AM, "Denis Magda" wrote: > Folks, > > Those who are craving for mind food this weekend come over the meetup - > Santa Clara, Sept 9, 9.30 AM: >

Spark Dataframe returning null columns when schema is specified

2017-09-07 Thread ravi6c2
Hi All, I have this problem where in Spark Dataframe is having null columns for the attributes from JSON that are not present. A clear explanation is provided below: *Use case:* Convert the JSON object into dataframe for further usage. *Case - 1:* Without specifying the schema for JSON:

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Mcclintic, Abbi
Thanks all ā€“ couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with repartitioning and without ā€“ in this example we are repartitioning to 10 but we also see the

Spark UI to use Marathon assigned port

2017-09-07 Thread Sunil Kalyanpur
Hello all, I am running PySpark Job (v2.0.2) with checkpoint enabled in Mesos cluster and am using Marathon for orchestration. When the job is restarted using Marathon, Spark UI is not getting started at the port specified by Marathon. Instead, it is picking port from the checkpoint. Is there a

Re: graphframe out of memory

2017-09-07 Thread Lukas Bradley
Did you also increase the size of the heap of the Java app that is starting Spark? https://alvinalexander.com/blog/post/java/java-xmx-xms-memory-heap-size-control On Thu, Sep 7, 2017 at 12:16 PM, Imran Rajjad wrote: > I am getting Out of Memory error while running

graphframe out of memory

2017-09-07 Thread Imran Rajjad
I am getting Out of Memory error while running connectedComponents job on graph with around 12000 vertices and 134600 edges. I am running spark in embedded mode in a standalone Java application and have tried to increase the memory but it seems that its not taking any effect sparkConf = new

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Patrick Alwell
Sounds like an S3 bug. Can you replicate locally with HDFS? Try using S3a protocol too; there is a jar you can leverage like so: spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py EMR can sometimes be buggy. :/ You could also try

RE: CSV write to S3 failing silently with partial completion

2017-09-07 Thread JG Perrin
Are you assuming that all partitions are of equal size? Did you try with more partitions (like repartitioning)? Does the error always happen with the last (or smaller) file? If you are sending to redshift, why not use the JDBC driver? -Original Message- From: abbim

Pyspark UDF causing ExecutorLostFailure

2017-09-07 Thread nicktgr15
Hi, I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as follows: from pyspark.sql.functions import col, udf from pyspark.sql.types import StringType path = 's3://some/parquet/dir/myfile.parquet' df = spark.read.load(path) def _test_udf(useragent): return

Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread ChenJun Zou
I am examined the code and found lazy val is added recently in 2.2.0 2017-09-07 14:34 GMT+08:00 ChenJun Zou : > thanks, > my mistake > > 2017-09-07 14:21 GMT+08:00 sujith chacko : > >> If your intention is to just view the logical plan in spark

Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread ChenJun Zou
thanks, my mistake 2017-09-07 14:21 GMT+08:00 sujith chacko : > If your intention is to just view the logical plan in spark shell then I > think you can follow the query which I mentioned in previous mail. In > spark 2.1.0 sessionState is a private member which you

Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread ChenJun Zou
spark-2.1.1 I use 2017-09-07 14:00 GMT+08:00 sujith chacko : > Hi, > may I know which version of spark you are using, in 2.2 I tried with > below query in spark-shell for viewing the logical plan and it's working > fine > > spark.sql("explain extended select *

CSV write to S3 failing silently with partial completion

2017-09-07 Thread abbim
Hi all, My team has been experiencing a recurring unpredictable bug where only a partial write to CSV in S3 on one partition of our Dataset is performed. For example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of the partitions as 2.8 GB in size, but one of them as 1.6 GB.