NPE in UDF yet no nulls in data because analyzer runs test with nulls

2017-04-14 Thread Koert Kuipers
we were running in to an NPE in one of our UDFs for spark sql.

now this particular function indeed could not handle nulls, but this was by
design since null input was never allowed (and we would want it to blow up
if there was a null as input).

we realized the issue was not in our data when we added filters for nulls
and the NPE still happened. then we also saw the NPE when just doing
dataframe.explain instead of running our job.

turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row
with all nulls ifs fed into the expression as a test. its the line:
val v = boundE.eval(emptyRow)

so should we conclude from this that all udfs should always be prepared to
handle nulls?


Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Thank you your reply, I will open pull request for this doc issue. The
logic is clear.

пт, 14 апр. 2017, 23:34 Michael Armbrust :

> 1)  could we update documentation for Structured Streaming and describe
>> that checkpointing could be specified by
>> spark.sql.streaming.checkpointLocation on SparkSession level and thus
>> automatically checkpoint dirs will be created per foreach query?
>>
>>
> Sure, please open a pull request.
>
>
>> 2) Do we really need to specify the checkpoint dir per query? what the
>> reason for this? finally we will be forced to write some checkpointDir name
>> generator, for example associate it with some particular named query and so
>> on?
>>
>
> Every query needs to have a unique checkpoint as this is how we track what
> has been processed.  If we don't have this, we can't restart the query
> where it left off.  In you example, I would suggest including the metric
> name in the checkpoint location path.
>
-- 

*Yours faithfully, *

*Kate Eri.*


Driver spins hours in query plan optimization

2017-04-14 Thread Everett Anderson
Hi,

We keep hitting a situation on Spark 2.0.2 (haven't tested later versions,
yet) where the driver spins forever seemingly in query plan optimization
for moderate queries, such as the union of a few (~5) other DataFrames.

We can see the driver spinning with one core in the nioEventLoopGroup-2-2
thread in a deep trace like the attached.

Throwing in a MEMORY_OR_DISK persist() so the query plan is collapsed works
around this, but it's a little surprising how often we encounter the
problem, forcing us to work to manage persisting/unpersisting tables and
potentially suffering unnecessary disk I/O.

I've looking through JIRA but don't see open issues about this -- might've
just not found them successfully.

Anyone else encounter this?
org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:228)
org.apache.spark.sql.catalyst.expressions.Cast.equals(Cast.scala:119)
scala.collection.IndexedSeqOptimized$class.sameElements(IndexedSeqOptimized.scala:167)
scala.collection.mutable.ArrayBuffer.sameElements(ArrayBuffer.scala:48)
scala.collection.GenSeqLike$class.equals(GenSeqLike.scala:475)
scala.collection.AbstractSeq.equals(Seq.scala:41)
org.apache.spark.sql.catalyst.expressions.Least.equals(conditionalExpressions.scala:290)
scala.collection.IndexedSeqOptimized$class.sameElements(IndexedSeqOptimized.scala:167)
scala.collection.mutable.ArrayBuffer.sameElements(ArrayBuffer.scala:48)
scala.collection.GenSeqLike$class.equals(GenSeqLike.scala:475)
scala.collection.AbstractSeq.equals(Seq.scala:41)
org.apache.spark.sql.catalyst.expressions.Greatest.equals(conditionalExpressions.scala:350)
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.equals(predicates.scala:522)
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151)
scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139)
scala.collection.mutable.HashSet.addElem(HashSet.scala:40)
scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59)
scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46)
scala.collection.mutable.HashSet.clone(HashSet.scala:83)
scala.collection.mutable.HashSet.clone(HashSet.scala:40)
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65)
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50)
scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104)
scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:151)
scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104)
scala.collection.SetLike$class.$plus$plus(SetLike.scala:141)
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:50)
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:300)
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:297)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:297)
org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:55)
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:174)
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints(QueryPlan.scala:174)
org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:55)
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:174)
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints(QueryPlan.scala:174)

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Michael Armbrust
>
> 1)  could we update documentation for Structured Streaming and describe
> that checkpointing could be specified by 
> spark.sql.streaming.checkpointLocation
> on SparkSession level and thus automatically checkpoint dirs will be
> created per foreach query?
>
>
Sure, please open a pull request.


> 2) Do we really need to specify the checkpoint dir per query? what the
> reason for this? finally we will be forced to write some checkpointDir name
> generator, for example associate it with some particular named query and so
> on?
>

Every query needs to have a unique checkpoint as this is how we track what
has been processed.  If we don't have this, we can't restart the query
where it left off.  In you example, I would suggest including the metric
name in the checkpoint location path.


PySpark row_number Question

2017-04-14 Thread infa elance
Hi All,
I trying to understand how row_number is applied In the below code, does spark 
store data in a dataframe and then perform row_number function or does it apply 
while reading from hive ?

from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS 
RowNum FROM tablename )

Appreciate any guidance.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Parameter in FlatMap function

2017-04-14 Thread Ankur Srivastava
You should instead broadcast your list and then use the broadcast variable
in the flatmap function.

Thanks
Ankur

On Fri, Apr 14, 2017 at 4:32 AM, Soheila S.  wrote:

> Hello all,
> Can someone help me to solve the following fundamental problem?
>
>
> I have a JavaRDD and as a flatMap method, I call a new instance of a class
> which implements FlatMapFunction. This class has a constructor method and a
> call method. In constructor method, I set the values for "List" variables
> which I need them in call method.
>
> In local run, there is no problem and it works correctly, but when I run
> it on cluster (using YARN) it doesn't know them and the variables are null.
> (I get Null pointer exception)
>
> Any solution, idea or hint will be really appreciated,
>
>
> All the best,
> Soheila
>


Spark Testing Library Discussion

2017-04-14 Thread Holden Karau
Hi Spark Users (+ Some Spark Testing Devs on BCC),

Awhile back on one of the many threads about testing in Spark there was
some interest in having a chat about the state of Spark testing and what
people want/need.

So if you are interested in joining an online (with maybe an IRL component
if enough people are SF based) chat about Spark testing please fill out
this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn

I think reasonable topics of discussion could be:

1) What is the state of the different Spark testing libraries in the
different core (Scala, Python, R, Java) and extended languages (C#,
Javascript, etc.)?
2) How do we make these more easily discovered by users?
3) What are people looking for in their testing libraries that we are
missing? (can be functionality, documentation, etc.)
4) Are there any examples of well tested open source Spark projects and
where are they?

If you have other topics that's awesome.

To clarify this about libraries and best practices for people testing their
Spark applications, and less about testing Spark's internals (although as
illustrated by some of the libraries there is some strong overlap in what
is required to make that work).

Cheers,

Holden :)

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: create column with map function apply to dataframe

2017-04-14 Thread Ankur Srivastava
If I understand your question you should look at withColumn of dataframe
api.

df.withColumn("len", len("l"))

Thanks
Ankur

On Fri, Apr 14, 2017 at 6:07 AM, issues solution 
wrote:

> Hi ,
>  how you can create column inside map function
>
>
> like that :
>
> df.map(lambd l : len(l) ) .
>
> but instead return rdd we create column insde data frame .
>


Memory problems with simple ETL in Pyspark

2017-04-14 Thread Patrick McCarthy
Hello,

I'm trying to build an ETL job which takes in 30-100gb of text data and
prepares it for SparkML. I don't speak Scala so I've been trying to
implement in PySpark on YARN, Spark 2.1.

Despite the transformations being fairly simple, the job always fails by
running out of executor memory.

The input table is long (~6bn rows) but composed of three simple values:

#
all_data_long.printSchema()

root
|-- id: long (nullable = true)
|-- label: short (nullable = true)
|-- segment: string (nullable = true)

#

First I join it to a table of particular segments of interests and do an
aggregation,

#

audiences.printSchema()

root
 |-- entry: integer (nullable = true)
 |-- descr: string (nullable = true)


print("Num in adl: {}".format(str(all_data_long.count(

aud_str = audiences.select(audiences['entry'].cast('string'),
audiences['descr'])

alldata_aud = all_data_long.join(aud_str,
all_data_long['segment']==aud_str['entry'],
'left_outer')

str_idx  = StringIndexer(inputCol='segment',outputCol='indexedSegs')

idx_df   = str_idx.fit(alldata_aud)
label_df =
idx_df.transform(alldata_aud).withColumnRenamed('label','label_val')

id_seg = (label_df
.filter(label_df.descr.isNotNull())
.groupBy('id')
.agg(collect_list('descr')))

id_seg.write.saveAsTable("hive.id_seg")

#

Then, I use that StringIndexer again on the first data frame to featurize
the segment ID

#

alldat_idx =
idx_df.transform(all_data_long).withColumnRenamed('label','label_val')

#


My ultimate goal is to make a SparseVector, so I group the indexed segments
by id and try to cast it into a vector

#

list_to_sparse_udf = udf(lambda l, maxlen: Vectors.sparse(maxlen, {v:1.0
for v in l}),VectorUDT())

alldat_idx.cache()

feature_vec_len = (alldat_idx.select(max('indexedSegs')).first()[0] + 1)

print("alldat_dix: {}".format(str(alldat_idx.count(

feature_df = (alldat_idx
.withColumn('label',alldat_idx['label_val'].cast('double'))
.groupBy('id','label')

.agg(sort_array(collect_list('indexedSegs')).alias('collect_list_is'))
.withColumn('num_feat',lit(feature_vec_len))

.withColumn('features',list_to_sparse_udf('collect_list_is','num_feat'))
.drop('collect_list_is')
.drop('num_feat'))

feature_df.cache()
print("Num in featuredf: {}".format(str(feature_df.count(  ## <-
failure occurs here

#

Here, however, I always run out of memory on the executors (I've twiddled
driver and executor memory to check) and YARN kills off my containers. I've
gone as high as —executor-memory 15g but it still doesn't help.

Given the number of segments is at most 50,000 I'm surprised that a
smallish row-wise operation is enough to blow up the process.


Is it really the UDF that's killing me? Do I have to rewrite it in Scala?





Query plans for the failing stage:

#


== Parsed Logical Plan ==
Aggregate [count(1) AS count#265L]
+- Project [id#0L, label#183, features#208]
   +- Project [id#0L, label#183, num_feat#202, features#208]
  +- Project [id#0L, label#183, collect_list_is#197, num_feat#202,
(collect_list_is#197, num_feat#202) AS features#208]
 +- Project [id#0L, label#183, collect_list_is#197, 56845.0 AS
num_feat#202]
+- Aggregate [id#0L, label#183], [id#0L, label#183,
sort_array(collect_list(indexedSegs#93, 0, 0), true) AS collect_list_is#197]
   +- Project [id#0L, label_val#99, segment#2, indexedSegs#93,
cast(label_val#99 as double) AS label#183]
  +- Project [id#0L, label#1 AS label_val#99, segment#2,
indexedSegs#93]
 +- Project [id#0L, label#1, segment#2,
UDF(cast(segment#2 as string)) AS indexedSegs#93]
+- MetastoreRelation pmccarthy, all_data_long

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#265L]
+- Project [id#0L, label#183, features#208]
   +- Project [id#0L, label#183, num_feat#202, features#208]
  +- Project [id#0L, label#183, collect_list_is#197, num_feat#202,
(collect_list_is#197, num_feat#202) AS features#208]
 +- Project [id#0L, label#183, collect_list_is#197, 56845.0 AS
num_feat#202]
+- Aggregate [id#0L, label#183], [id#0L, label#183,
sort_array(collect_list(indexedSegs#93, 0, 0), true) AS collect_list_is#197]
   +- Project [id#0L, label_val#99, segment#2, indexedSegs#93,
cast(label_val#99 as 

PySpark row_number Question

2017-04-14 Thread infa elance
Hi All,
I trying to understand how row_number is applied In the below code, does
spark store data in a dataframe and then perform row_number function or
does it apply while reading from hive ?

from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname)
AS RowNum FROM tablename )

Appreciate any guidance.


create column with map function apply to dataframe

2017-04-14 Thread issues solution
Hi ,
 how you can create column inside map function


like that :

df.map(lambd l : len(l) ) .

but instead return rdd we create column insde data frame .


Re: Spark API authentication

2017-04-14 Thread Saisai Shao
IIUC auth filter on the Live UI REST API should already be supported, the
fix in SPARK-19652 is mainly for the History UI to support per app based
ACL.

For application submission REST API in standalone mode, I think currently
it is not supported, it is not a bug.

On Fri, Apr 14, 2017 at 6:56 PM, Sergey Grigorev 
wrote:

> Thanks for help!
>
> I've found the ticket with a similar problem https://issues.apache.org/
> jira/browse/SPARK-19652. It looks like this fix did not hit to 2.1.0
> release.
> You said that for the second example custom filter is not supported. It
> is a bug or expected behavior?
>
> On 14.04.2017 13:22, Saisai Shao wrote:
>
> AFAIK, For the first line, custom filter should be worked. But for the
> latter it is not supported.
>
> On Fri, Apr 14, 2017 at 6:17 PM, Sergey Grigorev 
> wrote:
>
>> GET requests like *
>> http://worker:4040/api/v1/
>> applications *or 
>> *http://master:6066/v1/submissions/status/driver-20170414025324-
>>  *return
>> successful result. But if I open the spark master web ui then it requests
>> username and password.
>>
>>
>> On 14.04.2017 12:46, Saisai Shao wrote:
>>
>> Hi,
>>
>> What specifically are you referring to "Spark API endpoint"?
>>
>> Filter can only be worked with Spark Live and History web UI.
>>
>> On Fri, Apr 14, 2017 at 5:18 PM, Sergey < 
>> grigorev-...@yandex.ru> wrote:
>>
>>> Hello all,
>>>
>>> I've added own spark.ui.filters to enable basic authentication to access
>>> to
>>> Spark web UI. It works fine, but I still can do requests to spark API
>>> without any authentication.
>>> Is there any way to enable authentication for API endpoints?
>>>
>>> P.S. spark version is 2.1.0, deploy mode is standalone.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-AP
>>> I-authentication-tp28601.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: 
>>> user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>
>


Re: Yarn containers getting killed, error 52, multiple joins

2017-04-14 Thread Rick Moritz
Potentially, with joins, you run out of memory on a single executor,
because a small skew in your data is being amplified. You could try to
increase the default number of partitions, reduce the number of
simultaneous tasks in execution (executor.num.cores), or add a
repartitioning operation before/after the join.
To debug, you could try reducing the number of executors available, so you
can more easily see which job/stage ends up going (b)oom.

On Fri, Apr 14, 2017 at 12:05 AM, Chen, Mingrui 
wrote:

> 1.5TB is incredible high. It doesn't seem to be a configuration problem.
> Could you paste the code snippet doing the loop and join task on the
> dataset?
>
>
> Best regards,
>
> --
> *From:* rachmaninovquartet 
> *Sent:* Thursday, April 13, 2017 10:08:40 AM
> *To:* user@spark.apache.org
> *Subject:* Yarn containers getting killed, error 52, multiple joins
>
> Hi,
>
> I have a spark 1.6.2 app (tested previously in 2.0.0 as well). It is
> requiring a ton of memory (1.5TB) for a small dataset (~500mb). The memory
> usage seems to jump, when I loop through and inner join to make the dataset
> 12 times as wide. The app goes down during or after this loop, when I try
> to
> run a logistic regression on the generated dataframe. I'm using the scala
> API (2.10). Dynamic resource allocation is configured. Here are the
> parameters I'm using.
>
> --master yarn-client --queue analyst --executor-cores 5
> --executor-memory 40G --driver-memory 30G --conf spark.memory.fraction=0.75
> --conf spark.yarn.executor.memoryOverhead=5120
>
> Has anyone seen this or have an idea how to tune it? There is no way it
> should need so much memory.
>
> Thanks,
>
> Ian
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Yarn-containers-getting-killed-
> error-52-multiple-joins-tp28594.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: checkpoint

2017-04-14 Thread Jean Georges Perrin
Sorry - can't help with PySpark, but here is some Java code which you may be 
able to transform to Python? 
http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/

jg


> On Apr 14, 2017, at 07:18, issues solution  wrote:
> 
> Hi 
> somone can give me an complete example to work with chekpoint under Pyspark 
> 1.6 ?
> 
> thx
> regards


Parameter in FlatMap function

2017-04-14 Thread Soheila S.
Hello all,
Can someone help me to solve the following fundamental problem?


I have a JavaRDD and as a flatMap method, I call a new instance of a class
which implements FlatMapFunction. This class has a constructor method and a
call method. In constructor method, I set the values for "List" variables
which I need them in call method.

In local run, there is no problem and it works correctly, but when I run it
on cluster (using YARN) it doesn't know them and the variables are null. (I
get Null pointer exception)

Any solution, idea or hint will be really appreciated,


All the best,
Soheila


checkpoint

2017-04-14 Thread issues solution
Hi
somone can give me an complete example to work with chekpoint under Pyspark
1.6 ?

thx
regards


Re: Spark API authentication

2017-04-14 Thread Sergey Grigorev

Thanks for help!

I've found the ticket with a similar problem 
https://issues.apache.org/jira/browse/SPARK-19652. It looks like this 
fix did not hit to 2.1.0 release.
You said that for the second example custom filter is not supported. It 
is a bug or expected behavior?


On 14.04.2017 13:22, Saisai Shao wrote:
AFAIK, For the first line, custom filter should be worked. But for the 
latter it is not supported.


On Fri, Apr 14, 2017 at 6:17 PM, Sergey Grigorev 
> wrote:


GET requests like *http://worker:4040/api/v1/applications
 *or
*http://master:6066/v1/submissions/status/driver-20170414025324-

*return successful result. But if I open the spark master web ui
then it requests username and password.


On 14.04.2017 12:46, Saisai Shao wrote:

Hi,

What specifically are you referring to "Spark API endpoint"?

Filter can only be worked with Spark Live and History web UI.

On Fri, Apr 14, 2017 at 5:18 PM, Sergey > wrote:

Hello all,

I've added own spark.ui.filters to enable basic
authentication to access to
Spark web UI. It works fine, but I still can do requests to
spark API
without any authentication.
Is there any way to enable authentication for API endpoints?

P.S. spark version is 2.1.0, deploy mode is standalone.



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-API-authentication-tp28601.html


Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org










Re: Spark API authentication

2017-04-14 Thread Saisai Shao
AFAIK, For the first line, custom filter should be worked. But for the
latter it is not supported.

On Fri, Apr 14, 2017 at 6:17 PM, Sergey Grigorev 
wrote:

> GET requests like *http://worker:4040/api/v1/applications
>  *or 
> *http://master:6066/v1/submissions/status/driver-20170414025324-
>  *return
> successful result. But if I open the spark master web ui then it requests
> username and password.
>
>
> On 14.04.2017 12:46, Saisai Shao wrote:
>
> Hi,
>
> What specifically are you referring to "Spark API endpoint"?
>
> Filter can only be worked with Spark Live and History web UI.
>
> On Fri, Apr 14, 2017 at 5:18 PM, Sergey  wrote:
>
>> Hello all,
>>
>> I've added own spark.ui.filters to enable basic authentication to access
>> to
>> Spark web UI. It works fine, but I still can do requests to spark API
>> without any authentication.
>> Is there any way to enable authentication for API endpoints?
>>
>> P.S. spark version is 2.1.0, deploy mode is standalone.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-API-authentication-tp28601.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>


Re: Spark API authentication

2017-04-14 Thread Sergey Grigorev
GET requests like *http://worker:4040/api/v1/applications *or 
*http://master:6066/v1/submissions/status/driver-20170414025324- 
*return successful result. But if I open the spark master web ui then it 
requests username and password.


On 14.04.2017 12:46, Saisai Shao wrote:

Hi,

What specifically are you referring to "Spark API endpoint"?

Filter can only be worked with Spark Live and History web UI.

On Fri, Apr 14, 2017 at 5:18 PM, Sergey > wrote:


Hello all,

I've added own spark.ui.filters to enable basic authentication to
access to
Spark web UI. It works fine, but I still can do requests to spark API
without any authentication.
Is there any way to enable authentication for API endpoints?

P.S. spark version is 2.1.0, deploy mode is standalone.



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-API-authentication-tp28601.html


Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org







Re: Spark API authentication

2017-04-14 Thread Saisai Shao
Hi,

What specifically are you referring to "Spark API endpoint"?

Filter can only be worked with Spark Live and History web UI.

On Fri, Apr 14, 2017 at 5:18 PM, Sergey  wrote:

> Hello all,
>
> I've added own spark.ui.filters to enable basic authentication to access to
> Spark web UI. It works fine, but I still can do requests to spark API
> without any authentication.
> Is there any way to enable authentication for API endpoints?
>
> P.S. spark version is 2.1.0, deploy mode is standalone.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-API-authentication-tp28601.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark API authentication

2017-04-14 Thread Sergey
Hello all,

I've added own spark.ui.filters to enable basic authentication to access to
Spark web UI. It works fine, but I still can do requests to spark API
without any authentication.
Is there any way to enable authentication for API endpoints?

P.S. spark version is 2.1.0, deploy mode is standalone.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-API-authentication-tp28601.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Hello, guys.

I have initiated the ticket
https://issues.apache.org/jira/browse/SPARK-20325 ,

My case was: I launch two streams from one source stream *streamToProcess *like
this


streamToProcess

.groupBy(metric)

.agg(count(metric))

.writeStream

.outputMode("complete")

.option("checkpointLocation", checkpointDir)

.foreach(kafkaWriter)

.start()


After that I’ve got an exception:

Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as
another query with same id is already active. Perhaps you are attempting to
restart a query from checkpoint that is already active.


It is caused by that *StreamingQueryManager.scala* get the checkpoint dir
from stream’s configuration, and because my streams have equal
checkpointDirs, the second stream tries to recover instead of creating of
new one.For more details watch the ticket: SPARK-20325


1)  could we update documentation for Structured Streaming and describe
that checkpointing could be specified by
spark.sql.streaming.checkpointLocation on SparkSession level and thus
automatically checkpoint dirs will be created per foreach query?

2) Do we really need to specify the checkpoint dir per query? what the
reason for this? finally we will be forced to write some checkpointDir name
generator, for example associate it with some particular named query and so
on?

-- 

*Yours faithfully, *

*Kate Eri.*