NPE in UDF yet no nulls in data because analyzer runs test with nulls

2017-04-14 Thread Koert Kuipers
we were running in to an NPE in one of our UDFs for spark sql. now this particular function indeed could not handle nulls, but this was by design since null input was never allowed (and we would want it to blow up if there was a null as input). we realized the issue was not in our data when we

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Thank you your reply, I will open pull request for this doc issue. The logic is clear. пт, 14 апр. 2017, 23:34 Michael Armbrust : > 1) could we update documentation for Structured Streaming and describe >> that checkpointing could be specified by >>

Driver spins hours in query plan optimization

2017-04-14 Thread Everett Anderson
Hi, We keep hitting a situation on Spark 2.0.2 (haven't tested later versions, yet) where the driver spins forever seemingly in query plan optimization for moderate queries, such as the union of a few (~5) other DataFrames. We can see the driver spinning with one core in the

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Michael Armbrust
> > 1) could we update documentation for Structured Streaming and describe > that checkpointing could be specified by > spark.sql.streaming.checkpointLocation > on SparkSession level and thus automatically checkpoint dirs will be > created per foreach query? > > Sure, please open a pull request.

PySpark row_number Question

2017-04-14 Thread infa elance
Hi All, I trying to understand how row_number is applied In the below code, does spark store data in a dataframe and then perform row_number function or does it apply while reading from hive ? from pyspark.sql import HiveContext hiveContext = HiveContext(sc) hiveContext.sql(" ( SELECT colunm1

Re: Parameter in FlatMap function

2017-04-14 Thread Ankur Srivastava
You should instead broadcast your list and then use the broadcast variable in the flatmap function. Thanks Ankur On Fri, Apr 14, 2017 at 4:32 AM, Soheila S. wrote: > Hello all, > Can someone help me to solve the following fundamental problem? > > > I have a JavaRDD and as

Spark Testing Library Discussion

2017-04-14 Thread Holden Karau
Hi Spark Users (+ Some Spark Testing Devs on BCC), Awhile back on one of the many threads about testing in Spark there was some interest in having a chat about the state of Spark testing and what people want/need. So if you are interested in joining an online (with maybe an IRL component if

Re: create column with map function apply to dataframe

2017-04-14 Thread Ankur Srivastava
If I understand your question you should look at withColumn of dataframe api. df.withColumn("len", len("l")) Thanks Ankur On Fri, Apr 14, 2017 at 6:07 AM, issues solution wrote: > Hi , > how you can create column inside map function > > > like that : > >

Memory problems with simple ETL in Pyspark

2017-04-14 Thread Patrick McCarthy
Hello, I'm trying to build an ETL job which takes in 30-100gb of text data and prepares it for SparkML. I don't speak Scala so I've been trying to implement in PySpark on YARN, Spark 2.1. Despite the transformations being fairly simple, the job always fails by running out of executor memory.

PySpark row_number Question

2017-04-14 Thread infa elance
Hi All, I trying to understand how row_number is applied In the below code, does spark store data in a dataframe and then perform row_number function or does it apply while reading from hive ? from pyspark.sql import HiveContext hiveContext = HiveContext(sc) hiveContext.sql(" ( SELECT colunm1

create column with map function apply to dataframe

2017-04-14 Thread issues solution
Hi , how you can create column inside map function like that : df.map(lambd l : len(l) ) . but instead return rdd we create column insde data frame .

Re: Spark API authentication

2017-04-14 Thread Saisai Shao
IIUC auth filter on the Live UI REST API should already be supported, the fix in SPARK-19652 is mainly for the History UI to support per app based ACL. For application submission REST API in standalone mode, I think currently it is not supported, it is not a bug. On Fri, Apr 14, 2017 at 6:56 PM,

Re: Yarn containers getting killed, error 52, multiple joins

2017-04-14 Thread Rick Moritz
Potentially, with joins, you run out of memory on a single executor, because a small skew in your data is being amplified. You could try to increase the default number of partitions, reduce the number of simultaneous tasks in execution (executor.num.cores), or add a repartitioning operation

Re: checkpoint

2017-04-14 Thread Jean Georges Perrin
Sorry - can't help with PySpark, but here is some Java code which you may be able to transform to Python? http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ jg > On Apr 14, 2017, at 07:18, issues solution wrote: > > Hi > somone can give me an

Parameter in FlatMap function

2017-04-14 Thread Soheila S.
Hello all, Can someone help me to solve the following fundamental problem? I have a JavaRDD and as a flatMap method, I call a new instance of a class which implements FlatMapFunction. This class has a constructor method and a call method. In constructor method, I set the values for "List"

checkpoint

2017-04-14 Thread issues solution
Hi somone can give me an complete example to work with chekpoint under Pyspark 1.6 ? thx regards

Re: Spark API authentication

2017-04-14 Thread Sergey Grigorev
Thanks for help! I've found the ticket with a similar problem https://issues.apache.org/jira/browse/SPARK-19652. It looks like this fix did not hit to 2.1.0 release. You said that for the second example custom filter is not supported. It is a bug or expected behavior? On 14.04.2017 13:22,

Re: Spark API authentication

2017-04-14 Thread Saisai Shao
AFAIK, For the first line, custom filter should be worked. But for the latter it is not supported. On Fri, Apr 14, 2017 at 6:17 PM, Sergey Grigorev wrote: > GET requests like *http://worker:4040/api/v1/applications > *or >

Re: Spark API authentication

2017-04-14 Thread Sergey Grigorev
GET requests like *http://worker:4040/api/v1/applications *or *http://master:6066/v1/submissions/status/driver-20170414025324- *return successful result. But if I open the spark master web ui then it requests username and password. On 14.04.2017 12:46, Saisai Shao wrote: Hi, What

Re: Spark API authentication

2017-04-14 Thread Saisai Shao
Hi, What specifically are you referring to "Spark API endpoint"? Filter can only be worked with Spark Live and History web UI. On Fri, Apr 14, 2017 at 5:18 PM, Sergey wrote: > Hello all, > > I've added own spark.ui.filters to enable basic authentication to access to >

Spark API authentication

2017-04-14 Thread Sergey
Hello all, I've added own spark.ui.filters to enable basic authentication to access to Spark web UI. It works fine, but I still can do requests to spark API without any authentication. Is there any way to enable authentication for API endpoints? P.S. spark version is 2.1.0, deploy mode is

SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Hello, guys. I have initiated the ticket https://issues.apache.org/jira/browse/SPARK-20325 , My case was: I launch two streams from one source stream *streamToProcess *like this streamToProcess .groupBy(metric) .agg(count(metric)) .writeStream