sanboxing spark executors

2016-11-03 Thread blazespinnaker
Is there a good method / discussion / documentation on how to sandbox a spark
executor?   Assume the code is untrusted and you don't want it to be able to
make un validated network connections or do unvalidated alluxio/hdfs/file
io.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sanboxing-spark-executors-tp28014.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla,

there is the documentation for setting up IDE -
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup

I hope this is helpful.


2016-11-04 9:10 GMT+09:00 shyla deshpande :

> Hello Everyone,
>
> I just installed Spark 2.0.1, spark shell works fine.
>
> Was able to run some simple programs from the Spark Shell, but find it
> hard to make the same program work when using IntelliJ.
>  I am getting the following error.
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.$scope()Lscala/xml/TopScope$;
> at org.apache.spark.ui.jobs.StagePage.(StagePage.scala:44)
> at org.apache.spark.ui.jobs.StagesTab.(StagesTab.scala:34)
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:62)
> at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215)
> at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157)
> at org.apache.spark.SparkContext.(SparkContext.scala:440)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2275)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 8.apply(SparkSession.scala:831)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 8.apply(SparkSession.scala:823)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:823)
> at SparkSessionTest.DataSetWordCount$.main(DataSetWordCount.scala:15)
> at SparkSessionTest.DataSetWordCount.main(DataSetWordCount.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
>
> Thanks
> -Shyla
>
>
>


Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning
leaving data as they are, including prefixes).


The problem was, each partition needs to produce each record with knowing
the namesapces.

It is fine to deal with them if they are within each XML documentation
(represented as a row in dataframe) but

it becomes problematic if they are in the parent of each XML documentation
(represented as a row in dataframe).


There is an issue open for this,
https://github.com/databricks/spark-xml/issues/74

It'd be nicer if we have an option to enable/disable this if we can
properly support namespace handling.


We might be able to talk more there.



2016-11-04 6:37 GMT+09:00 Arun Patel :

> I see that 'ignoring namespaces' issue is resolved.
>
> https://github.com/databricks/spark-xml/pull/75
>
> How do we enable this option and ignore namespace prefixes?
>
> - Arun
>


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
Thinking out loud is good :)

You are right in that anytime you ask for a global ordering from Spark you
will pay the cost of figuring out the range boundaries for partitions.  If
you say orderBy, though, we aren't sure that you aren't expecting a global
order.

If you only want to make sure that items are colocated, it is cheaper to do
a groupByKey followed by a flatMapGroups

.



On Thu, Nov 3, 2016 at 7:31 PM, Koert Kuipers  wrote:

> i guess i could sort by (hashcode(key), key, secondarySortColumn) and then
> do mapPartitions?
>
> sorry thinking out loud a bit here. ok i think that could work. thanks
>
> On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers  wrote:
>
>> thats an interesting thought about orderBy and mapPartitions. i guess i
>> could emulate a groupBy with secondary sort using those two. however isn't
>> using an orderBy expensive since it is a total sort? i mean a groupBy with
>> secondary sort is also a total sort under the hood, but its on
>> (hashCode(key), secondarySortColumn) which is easier to distribute and
>> therefore can be implemented more efficiently.
>>
>>
>>
>>
>>
>> On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust 
>> wrote:
>>
>>> It is still unclear to me why we should remember all these tricks (or
 add lots of extra little functions) when this elegantly can be expressed in
 a reduce operation with a simple one line lamba function.

>>> I think you can do that too.  KeyValueGroupedDataset has a reduceGroups
>>> function.  This probably won't be as fast though because you end up
>>> creating objects where as the version I gave will get codgened to operate
>>> on binary data the whole way though.
>>>
 The same applies to these Window functions. I had to read it 3 times to
 understand what it all means. Maybe it makes sense for someone who has been
 forced to use such limited tools in sql for many years but that's not
 necessary what we should aim for. Why can I not just have the sortBy and
 then an Iterator[X] => Iterator[Y] to express what I want to do?

>>> We also have orderBy and mapPartitions.
>>>
 All these functions (rank etc.) can be trivially expressed in this,
 plus I can add other operations if needed, instead of being locked in like
 this Window framework.

>>>  I agree that window functions would probably not be my first choice for
>>> many problems, but for people coming from SQL it was a very popular
>>> feature.  My real goal is to give as many paradigms as possible in a single
>>> unified framework.  Let people pick the right mode of expression for any
>>> given job :)
>>>
>>
>>
>


Re: example LDA code ClassCastException

2016-11-03 Thread Asher Krim
There is an open Jira for this issue (
https://issues.apache.org/jira/browse/SPARK-14804). There have been a few
proposed fixes so far.

On Thu, Nov 3, 2016 at 2:20 PM, jamborta  wrote:

> Hi there,
>
> I am trying to run the example LDA code
> (http://spark.apache.org/docs/latest/mllib-clustering.html#
> latent-dirichlet-allocation-lda)
> on Spark 2.0.0/EMR 5.0.0
>
> If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/")
>
> ldaModel = LDA.train(corpus, k=3, maxIterations=200, checkpointInterval=10)
>
> I get the following error (sorry, quite long):
>
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 ldaModel = LDA.train(corpus, k=3, maxIterations=200,
> checkpointInterval=10)
>
> /usr/lib/spark/python/pyspark/mllib/clustering.py in train(cls, rdd, k,
> maxIterations, docConcentration, topicConcentration, seed,
> checkpointInterval, optimizer)
>1037 model = callMLlibFunc("trainLDAModel", rdd, k,
> maxIterations,
>1038   docConcentration, topicConcentration,
> seed,
> -> 1039   checkpointInterval, optimizer)
>1040 return LDAModel(model)
>1041
>
> /usr/lib/spark/python/pyspark/mllib/common.py in callMLlibFunc(name,
> *args)
> 128 sc = SparkContext.getOrCreate()
> 129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
> --> 130 return callJavaFunc(sc, api, *args)
> 131
> 132
>
> /usr/lib/spark/python/pyspark/mllib/common.py in callJavaFunc(sc, func,
> *args)
> 121 """ Call Java Function """
> 122 args = [_py2java(sc, a) for a in args]
> --> 123 return _java2py(sc, func(*args))
> 124
> 125
>
> /usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in
> __call__(self, *args)
> 931 answer = self.gateway_client.send_command(command)
> 932 return_value = get_return_value(
> --> 933 answer, self.gateway_client, self.target_id, self.name
> )
> 934
> 935 for temp_arg in temp_args:
>
> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
>
> /usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in
> get_return_value(answer, gateway_client, target_id, name)
> 310 raise Py4JJavaError(
> 311 "An error occurred while calling {0}{1}{2}.\n".
> --> 312 format(target_id, ".", name), value)
> 313 else:
> 314 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling o115.trainLDAModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1
> in stage 458.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 458.0 (TID 14827, ip-10-197-192-2.eu-west-1.compute.internal):
> java.lang.ClassCastException: scala.Tuple2 cannot be cast to
> org.apache.spark.graphx.Edge
> at
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$
> apply$1.apply(EdgeRDD.scala:107)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at
> org.apache.spark.InterruptibleIterator.foreach(
> InterruptibleIterator.scala:28)
> at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(
> EdgeRDD.scala:107)
> at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(
> EdgeRDD.scala:105)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$
> anonfun$apply$25.apply(RDD.scala:801)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$
> anonfun$apply$25.apply(RDD.scala:801)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
> at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
> at
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:919)
> at
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.doPut(
> BlockManager.scala:866)
> at
> org.apache.spark.storage.BlockManager.doPutIterator(
> BlockManager.scala:910)
> at
> org.apache.spark.storage.BlockManager.getOrElseUpdate(
> BlockManager.scala:668)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)

expected behavior of Kafka dynamic topic subscription

2016-11-03 Thread Haopu Wang
I'm using Kafka010 integration API to create a DStream using
SubscriberPattern ConsumerStrategy.

The specified topic doesn't exist when I start the application.

Then I create the topic and publish some test messages. I can see them
in the console subscriber.

But the spark application doesn't seem to get the messages.

I think this is not expected, right? What should I check to resolve it?

Thank you very much!



Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i guess i could sort by (hashcode(key), key, secondarySortColumn) and then
do mapPartitions?

sorry thinking out loud a bit here. ok i think that could work. thanks

On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers  wrote:

> thats an interesting thought about orderBy and mapPartitions. i guess i
> could emulate a groupBy with secondary sort using those two. however isn't
> using an orderBy expensive since it is a total sort? i mean a groupBy with
> secondary sort is also a total sort under the hood, but its on
> (hashCode(key), secondarySortColumn) which is easier to distribute and
> therefore can be implemented more efficiently.
>
>
>
>
>
> On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust 
> wrote:
>
>> It is still unclear to me why we should remember all these tricks (or add
>>> lots of extra little functions) when this elegantly can be expressed in a
>>> reduce operation with a simple one line lamba function.
>>>
>> I think you can do that too.  KeyValueGroupedDataset has a reduceGroups
>> function.  This probably won't be as fast though because you end up
>> creating objects where as the version I gave will get codgened to operate
>> on binary data the whole way though.
>>
>>> The same applies to these Window functions. I had to read it 3 times to
>>> understand what it all means. Maybe it makes sense for someone who has been
>>> forced to use such limited tools in sql for many years but that's not
>>> necessary what we should aim for. Why can I not just have the sortBy and
>>> then an Iterator[X] => Iterator[Y] to express what I want to do?
>>>
>> We also have orderBy and mapPartitions.
>>
>>> All these functions (rank etc.) can be trivially expressed in this, plus
>>> I can add other operations if needed, instead of being locked in like this
>>> Window framework.
>>>
>>  I agree that window functions would probably not be my first choice for
>> many problems, but for people coming from SQL it was a very popular
>> feature.  My real goal is to give as many paradigms as possible in a single
>> unified framework.  Let people pick the right mode of expression for any
>> given job :)
>>
>
>


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
thats an interesting thought about orderBy and mapPartitions. i guess i
could emulate a groupBy with secondary sort using those two. however isn't
using an orderBy expensive since it is a total sort? i mean a groupBy with
secondary sort is also a total sort under the hood, but its on
(hashCode(key), secondarySortColumn) which is easier to distribute and
therefore can be implemented more efficiently.





On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust 
wrote:

> It is still unclear to me why we should remember all these tricks (or add
>> lots of extra little functions) when this elegantly can be expressed in a
>> reduce operation with a simple one line lamba function.
>>
> I think you can do that too.  KeyValueGroupedDataset has a reduceGroups
> function.  This probably won't be as fast though because you end up
> creating objects where as the version I gave will get codgened to operate
> on binary data the whole way though.
>
>> The same applies to these Window functions. I had to read it 3 times to
>> understand what it all means. Maybe it makes sense for someone who has been
>> forced to use such limited tools in sql for many years but that's not
>> necessary what we should aim for. Why can I not just have the sortBy and
>> then an Iterator[X] => Iterator[Y] to express what I want to do?
>>
> We also have orderBy and mapPartitions.
>
>> All these functions (rank etc.) can be trivially expressed in this, plus
>> I can add other operations if needed, instead of being locked in like this
>> Window framework.
>>
>  I agree that window functions would probably not be my first choice for
> many problems, but for people coming from SQL it was a very popular
> feature.  My real goal is to give as many paradigms as possible in a single
> unified framework.  Let people pick the right mode of expression for any
> given job :)
>


unsubscribe

2016-11-03 Thread शशिकांत कुलकर्णी
Regards,
Shashikant


Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth  wrote:
> What is the purpose of the delegation token renewal (the one that is done
> automatically by Hadoop libraries, after 1 day by default)? It seems that it
> always happens (every day) until the token expires, no matter what. I'd
> probably find an answer to that in a basic Hadoop security description.

I'm not sure and I never really got a good answer to that (I had the
same question in the past). My best guess is to limit how long an
attacker can do bad things if he gets hold of a delegation token. But
IMO if an attacker gets a delegation token, that's pretty bad
regardless of how long he can use it...

> I have a feeling that giving the keytab to Spark bypasses the concept behind
> delegation tokens. As I understand, the NN basically says that "your
> application can access hdfs with this delegation token, but only for 7
> days".

I'm not sure why there's a 7 day limit either, but let's assume
there's a good reason. Basically the app, at that point, needs to
prove to the NN it has a valid kerberos credential. Whether that's
from someone typing their password into a terminal, or code using a
keytab, it doesn't really matter. If someone was worried about that
user being malicious they'd disable the user's login in the KDC.

This feature is needed because there are apps that need to keep
running, unattended, for longer than HDFS's max lifetime setting.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
>
> It is still unclear to me why we should remember all these tricks (or add
> lots of extra little functions) when this elegantly can be expressed in a
> reduce operation with a simple one line lamba function.
>
I think you can do that too.  KeyValueGroupedDataset has a reduceGroups
function.  This probably won't be as fast though because you end up
creating objects where as the version I gave will get codgened to operate
on binary data the whole way though.

> The same applies to these Window functions. I had to read it 3 times to
> understand what it all means. Maybe it makes sense for someone who has been
> forced to use such limited tools in sql for many years but that's not
> necessary what we should aim for. Why can I not just have the sortBy and
> then an Iterator[X] => Iterator[Y] to express what I want to do?
>
We also have orderBy and mapPartitions.

> All these functions (rank etc.) can be trivially expressed in this, plus I
> can add other operations if needed, instead of being locked in like this
> Window framework.
>
 I agree that window functions would probably not be my first choice for
many problems, but for people coming from SQL it was a very popular
feature.  My real goal is to give as many paradigms as possible in a single
unified framework.  Let people pick the right mode of expression for any
given job :)


Error creating SparkSession, in IntelliJ

2016-11-03 Thread shyla deshpande
Hello Everyone,

I just installed Spark 2.0.1, spark shell works fine.

Was able to run some simple programs from the Spark Shell, but find it hard
to make the same program work when using IntelliJ.
 I am getting the following error.

Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$scope()Lscala/xml/TopScope$;
at org.apache.spark.ui.jobs.StagePage.(StagePage.scala:44)
at org.apache.spark.ui.jobs.StagesTab.(StagesTab.scala:34)
at org.apache.spark.ui.SparkUI.(SparkUI.scala:62)
at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215)
at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157)
at org.apache.spark.SparkContext.(SparkContext.scala:440)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2275)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at SparkSessionTest.DataSetWordCount$.main(DataSetWordCount.scala:15)
at SparkSessionTest.DataSetWordCount.main(DataSetWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Thanks
-Shyla


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Oh okay that makes sense. The trick is to take max on tuple2 so you carry
the other column along.

It is still unclear to me why we should remember all these tricks (or add
lots of extra little functions) when this elegantly can be expressed in a
reduce operation with a simple one line lamba function.

The same applies to these Window functions. I had to read it 3 times to
understand what it all means. Maybe it makes sense for someone who has been
forced to use such limited tools in sql for many years but that's not
necessary what we should aim for. Why can I not just have the sortBy and
then an Iterator[X] => Iterator[Y] to express what I want to do? All these
functions (rank etc.) can be trivially expressed in this, plus I can add
other operations if needed, instead of being locked in like this Window
framework.

On Nov 3, 2016 4:10 PM, "Michael Armbrust"  wrote:

You are looking to perform an *argmax*, which you can do with a single
aggregation.  Here is an example

.

On Thu, Nov 3, 2016 at 4:53 AM, Rabin Banerjee  wrote:

> Hi All ,
>
>   I want to do a dataframe operation to find the rows having the latest
> timestamp in each group using the below operation
>
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
>
>
> *Spark Version :: 1.6.x*
>
> My Question is *"Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??*
>
>
> *I referred a blog here :: 
> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> *
>
> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>
>
> *I need a bit elaboration of how internally spark handles it ? also is it 
> more efficient than using a Window function ?*
>
>
> *Thanks in Advance ,*
>
> *Rabin Banerjee*
>
>
>
>


Re: Use BLAS object for matrix operation

2016-11-03 Thread Stephen Boesch
It is private. You will need to put your code in that same package or
create an accessor to it living within that package

private[spark]


2016-11-03 16:04 GMT-07:00 Yanwei Zhang :

> I would like to use some matrix operations in the BLAS object defined in
> ml.linalg. But for some reason, spark shell complains it cannot locate this
> object. I have constructed an example below to illustrate the issue. Please
> advise how to fix this. Thanks .
>
>
>
> import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors}
>
>
> val a = Vectors.dense(1.0, 2.0)
> val b = Vectors.dense(2.0, 3.0)
> BLAS.dot(a, b)
>
>
> :42: error: not found: value BLAS
>
>
>


Use BLAS object for matrix operation

2016-11-03 Thread Yanwei Zhang
I would like to use some matrix operations in the BLAS object defined in 
ml.linalg. But for some reason, spark shell complains it cannot locate this 
object. I have constructed an example below to illustrate the issue. Please 
advise how to fix this. Thanks .



import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors}


val a = Vectors.dense(1.0, 2.0)
val b = Vectors.dense(2.0, 3.0)
BLAS.dot(a, b)


:42: error: not found: value BLAS



Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Thank you for the clarification Marcelo, makes sense.
I'm thinking about 2 questions here, somewhat unrelated to the original
problem.

What is the purpose of the delegation token renewal (the one that is done
automatically by Hadoop libraries, after 1 day by default)? It seems that
it always happens (every day) until the token expires, no matter what. I'd
probably find an answer to that in a basic Hadoop security description.

I have a feeling that giving the keytab to Spark bypasses the concept
behind delegation tokens. As I understand, the NN basically says that "your
application can access hdfs with this delegation token, but only for 7
days". After 7 days, the NN should *ideally* ask me like "this app runs for
a week now, do you want to continue that?" - then I'd need to login with my
keytab and give the new delegation token to the application. I know that
this would be really difficult to handle, but now Spark just "ignores" the
whole token expiration mechanism and relogins every time it is needed. Am I
missing something?



2016-11-03 22:42 GMT+01:00 Marcelo Vanzin :

> I think you're a little confused about what "renewal" means here, and
> this might be the fault of the documentation (I haven't read it in a
> while).
>
> The existing delegation tokens will always be "renewed", in the sense
> that Spark (actually Hadoop code invisible to Spark) will talk to the
> NN to extend its lifetime. The feature you're talking about is for
> creating *new* delegation tokens after the old ones expire and cannot
> be renewed anymore (i.e. the max-lifetime configuration).
>
> On Thu, Nov 3, 2016 at 2:02 PM, Zsolt Tóth 
> wrote:
> > Yes, I did change dfs.namenode.delegation.key.update-interval and
> > dfs.namenode.delegation.token.renew-interval to 15 min, the
> max-lifetime to
> > 30min. In this case the application (without Spark having the keytab) did
> > not fail after 15 min, only after 30 min. Is it possible that the
> resource
> > manager somehow automatically renews the delegation tokens for my
> > application?
> >
> > 2016-11-03 21:34 GMT+01:00 Marcelo Vanzin :
> >>
> >> Sounds like your test was set up incorrectly. The default TTL for
> >> tokens is 7 days. Did you change that in the HDFS config?
> >>
> >> The issue definitely exists and people definitely have run into it. So
> >> if you're not hitting it, it's most definitely an issue with your test
> >> configuration.
> >>
> >> On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth 
> >> wrote:
> >> > Hi,
> >> >
> >> > I ran some tests regarding Spark's Delegation Token renewal mechanism.
> >> > As I
> >> > see, the concept here is simple: if I give my keytab file and client
> >> > principal to Spark, it starts a token renewal thread, and renews the
> >> > namenode delegation tokens after some time. This works fine.
> >> >
> >> > Then I tried to run a long application (with HDFS operation in the
> end)
> >> > without providing the keytab/principal to Spark, and I expected it to
> >> > fail
> >> > after the token expires. It turned out that this is not the case, the
> >> > application finishes successfully without a delegation token renewal
> by
> >> > Spark.
> >> >
> >> > My question is: how is that possible? Shouldn't a saveAsTextfile()
> fail
> >> > after the namenode delegation token expired?
> >> >
> >> > Regards,
> >> > Zsolt
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>


Re: RuntimeException: Null value appeared in non-nullable field when holding Optional Case Class

2016-11-03 Thread Aniket Bhatnagar
Issue raised - SPARK-18251

On Wed, Nov 2, 2016, 9:12 PM Aniket Bhatnagar 
wrote:

> Hi all
>
> I am running into a runtime exception when a DataSet is holding an Empty
> object instance for an Option type that is holding non-nullable field. For
> instance, if we have the following case class:
>
> case class DataRow(id: Int, value: String)
>
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and
> cannot hold Empty. If it does so, the following exception is thrown:
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent
> failure: Lost task 6.0 in stage 0.0 (TID 6, localhost):
> java.lang.RuntimeException: Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class:
> "com.aol.advertising.dmp.audscale.uts.DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean,
> please try to use scala.Option[_] or other nullable types (e.g.
> java.lang.Integer instead of int/scala.Int).
>
>
> I am attaching a sample program to reproduce this. Is this a known
> limitation or a bug?
>
> Thanks,
> Aniket
>
> Full stack trace:
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent
> failure: Lost task 6.0 in stage 0.0 (TID 6, localhost):
> java.lang.RuntimeException: Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean,
> please try to use scala.Option[_] or other nullable types (e.g.
> java.lang.Integer instead of int/scala.Int).
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>


Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
I think you're a little confused about what "renewal" means here, and
this might be the fault of the documentation (I haven't read it in a
while).

The existing delegation tokens will always be "renewed", in the sense
that Spark (actually Hadoop code invisible to Spark) will talk to the
NN to extend its lifetime. The feature you're talking about is for
creating *new* delegation tokens after the old ones expire and cannot
be renewed anymore (i.e. the max-lifetime configuration).

On Thu, Nov 3, 2016 at 2:02 PM, Zsolt Tóth  wrote:
> Yes, I did change dfs.namenode.delegation.key.update-interval and
> dfs.namenode.delegation.token.renew-interval to 15 min, the max-lifetime to
> 30min. In this case the application (without Spark having the keytab) did
> not fail after 15 min, only after 30 min. Is it possible that the resource
> manager somehow automatically renews the delegation tokens for my
> application?
>
> 2016-11-03 21:34 GMT+01:00 Marcelo Vanzin :
>>
>> Sounds like your test was set up incorrectly. The default TTL for
>> tokens is 7 days. Did you change that in the HDFS config?
>>
>> The issue definitely exists and people definitely have run into it. So
>> if you're not hitting it, it's most definitely an issue with your test
>> configuration.
>>
>> On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth 
>> wrote:
>> > Hi,
>> >
>> > I ran some tests regarding Spark's Delegation Token renewal mechanism.
>> > As I
>> > see, the concept here is simple: if I give my keytab file and client
>> > principal to Spark, it starts a token renewal thread, and renews the
>> > namenode delegation tokens after some time. This works fine.
>> >
>> > Then I tried to run a long application (with HDFS operation in the end)
>> > without providing the keytab/principal to Spark, and I expected it to
>> > fail
>> > after the token expires. It turned out that this is not the case, the
>> > application finishes successfully without a delegation token renewal by
>> > Spark.
>> >
>> > My question is: how is that possible? Shouldn't a saveAsTextfile() fail
>> > after the namenode delegation token expired?
>> >
>> > Regards,
>> > Zsolt
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark XML ignore namespaces

2016-11-03 Thread Arun Patel
I see that 'ignoring namespaces' issue is resolved.

https://github.com/databricks/spark-xml/pull/75

How do we enable this option and ignore namespace prefixes?

- Arun


Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Yes, I did change dfs.namenode.delegation.key.update-interval
and dfs.namenode.delegation.token.renew-interval to 15 min, the
max-lifetime to 30min. In this case the application (without Spark having
the keytab) did not fail after 15 min, only after 30 min. Is it possible
that the resource manager somehow automatically renews the delegation
tokens for my application?

2016-11-03 21:34 GMT+01:00 Marcelo Vanzin :

> Sounds like your test was set up incorrectly. The default TTL for
> tokens is 7 days. Did you change that in the HDFS config?
>
> The issue definitely exists and people definitely have run into it. So
> if you're not hitting it, it's most definitely an issue with your test
> configuration.
>
> On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth 
> wrote:
> > Hi,
> >
> > I ran some tests regarding Spark's Delegation Token renewal mechanism.
> As I
> > see, the concept here is simple: if I give my keytab file and client
> > principal to Spark, it starts a token renewal thread, and renews the
> > namenode delegation tokens after some time. This works fine.
> >
> > Then I tried to run a long application (with HDFS operation in the end)
> > without providing the keytab/principal to Spark, and I expected it to
> fail
> > after the token expires. It turned out that this is not the case, the
> > application finishes successfully without a delegation token renewal by
> > Spark.
> >
> > My question is: how is that possible? Shouldn't a saveAsTextfile() fail
> > after the namenode delegation token expired?
> >
> > Regards,
> > Zsolt
>
>
>
> --
> Marcelo
>


Slow Parquet write to HDFS using Spark

2016-11-03 Thread morfious902002
I am using Spark 1.6.1 and writing to HDFS. In some cases it seems like all
the work is being done by one thread. Why is that?

Also, I need parquet.enable.summary-metadata to register the parquet files
to Impala.

Df.write().partitionBy("COLUMN").parquet(outputFileLocation);

It also, seems like all of this happens in one cpu of a executor.

16/11/03 14:59:20 INFO datasources.DynamicPartitionWriterContainer: 
Using
user defined output committer class
org.apache.parquet.hadoop.ParquetOutputCommitter
16/11/03 14:59:20 INFO mapred.SparkHadoopMapRedUtil: No need to commit
output of task because needsTaskCommit=false:
attempt_201611031459_0154_m_29_0
16/11/03 15:17:56 INFO sort.UnsafeExternalSorter: Thread 545 spilling 
sort
data of 41.9 GB to disk (3  times so far)
16/11/03 15:21:05 INFO storage.ShuffleBlockFetcherIterator: Getting 0
non-empty blocks out of 0 blocks
16/11/03 15:21:05 INFO storage.ShuffleBlockFetcherIterator: Started 0
remote fetches in 1 ms
16/11/03 15:21:05 INFO datasources.DynamicPartitionWriterContainer: 
Using
user defined output committer class
org.apache.parquet.hadoop.ParquetOutputCommitter
16/11/03 15:21:05 INFO codec.CodecConfig: Compression: GZIP
16/11/03 15:21:05 INFO hadoop.ParquetOutputFormat: Parquet block size to
134217728
16/11/03 15:21:05 INFO hadoop.ParquetOutputFormat: Parquet page size to
1048576
16/11/03 15:21:05 INFO hadoop.ParquetOutputFormat: Parquet dictionary 
page
size to 1048576
16/11/03 15:21:05 INFO hadoop.ParquetOutputFormat: Dictionary is on
16/11/03 15:21:05 INFO hadoop.ParquetOutputFormat: Validation is off
16/11/03 15:21:05 INFO hadoop.ParquetOutputFormat: Writer version is:
PARQUET_1_0
16/11/03 15:21:05 INFO parquet.CatalystWriteSupport: Initialized Parquet
WriteSupport with Catalyst schema:

Then again :-

16/11/03 15:21:05 INFO compress.CodecPool: Got brand-new compressor 
[.gz]
16/11/03 15:21:05 INFO datasources.DynamicPartitionWriterContainer: 
Maximum
partitions reached, falling back on sorting.
16/11/03 15:32:37 INFO sort.UnsafeExternalSorter: Thread 545 spilling 
sort
data of 31.8 GB to disk (0  time so far)
16/11/03 15:45:47 INFO sort.UnsafeExternalSorter: Thread 545 spilling 
sort
data of 31.8 GB to disk (1  time so far)
16/11/03 15:48:44 INFO datasources.DynamicPartitionWriterContainer: 
Sorting
complete. Writing out partition files one at a time.
16/11/03 15:48:44 INFO codec.CodecConfig: Compression: GZIP
16/11/03 15:48:44 INFO hadoop.ParquetOutputFormat: Parquet block size to
134217728
16/11/03 15:48:44 INFO hadoop.ParquetOutputFormat: Parquet page size to
1048576
16/11/03 15:48:44 INFO hadoop.ParquetOutputFormat: Parquet dictionary 
page
size to 1048576
16/11/03 15:48:44 INFO hadoop.ParquetOutputFormat: Dictionary is on
16/11/03 15:48:44 INFO hadoop.ParquetOutputFormat: Validation is off
16/11/03 15:48:44 INFO hadoop.ParquetOutputFormat: Writer version is:
PARQUET_1_0
16/11/03 15:48:44 INFO parquet.CatalystWriteSupport: Initialized Parquet
WriteSupport with Catalyst schema:

The Schema 

About 200 of the following lines again and again 20 times or so. 
 
16/11/03 15:48:44 INFO compress.CodecPool: Got brand-new compressor 
[.gz]
16/11/03 15:49:50 INFO hadoop.InternalParquetRecordWriter: mem size
135,903,551 > 134,217,728: flushing 1,040,100 records to disk.
16/11/03 15:49:50 INFO hadoop.InternalParquetRecordWriter: Flushing mem
columnStore to file. allocated memory: 89,688,651

About 200 of the following lines

16/11/03 15:49:51 INFO hadoop.ColumnChunkPageWriteStore: written 
413,231B
for [a17bbfb1_2808_11e6_a4e6_77b5e8f92a4f] BINARY: 1,040,100 values,
1,138,534B raw, 412,919B comp, 8 pages, encodings: [RLE, BIT_PACKED,
PLAIN_DICTIONARY], dic { 356 entries, 2,848B raw, 356B comp}

Then at last:- 

16/11/03 16:15:41 INFO output.FileOutputCommitter: Saved output of task
'attempt_201611031521_0154_m_40_0' to
hdfs://PATH/_temporary/0/task_201611031521_0154_m_40
16/11/03 16:15:41 INFO mapred.SparkHadoopMapRedUtil:
attempt_201611031521_0154_m_40_0: Committed
16/11/03 16:15:41 INFO executor.Executor: Finished task 40.0 in stage 
154.0
(TID 8545). 3757 bytes result sent to driver



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Slow-Parquet-write-to-HDFS-using-Spark-tp28011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-03 Thread Mohit Jaggi
For linear regression, it should be fairly easy. Just sort the co-efficients :)

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca  wrote:
> 
> Hi All,
> 
> I am using SPARK and in particular the MLib library.
> 
> import org.apache.spark.mllib.regression.LabeledPoint;
> import org.apache.spark.mllib.regression.LinearRegressionModel;
> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
> 
> For my problem I am using the LinearRegressionWithSGD and I would like to 
> perform a “Rank Features By Importance”.
> 
> I checked the documentation and it seems that does not provide such methods.
> 
> Am I missing anything?  Please, could you provide any help on this?
> Should I change the approach?
> 
> Many Thanks in advance,
> 
> Best Regards,
> Carlo
> 
> 
> -- The Open University is incorporated by Royal Charter (RC 000391), an 
> exempt charity in England & Wales and a charity registered in Scotland (SC 
> 038302). The Open University is authorised and regulated by the Financial 
> Conduct Authority.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
Sounds like your test was set up incorrectly. The default TTL for
tokens is 7 days. Did you change that in the HDFS config?

The issue definitely exists and people definitely have run into it. So
if you're not hitting it, it's most definitely an issue with your test
configuration.

On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth  wrote:
> Hi,
>
> I ran some tests regarding Spark's Delegation Token renewal mechanism. As I
> see, the concept here is simple: if I give my keytab file and client
> principal to Spark, it starts a token renewal thread, and renews the
> namenode delegation tokens after some time. This works fine.
>
> Then I tried to run a long application (with HDFS operation in the end)
> without providing the keytab/principal to Spark, and I expected it to fail
> after the token expires. It turned out that this is not the case, the
> application finishes successfully without a delegation token renewal by
> Spark.
>
> My question is: how is that possible? Shouldn't a saveAsTextfile() fail
> after the namenode delegation token expired?
>
> Regards,
> Zsolt



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Any ideas about this one? Am I missing something here?

2016-11-03 15:22 GMT+01:00 Zsolt Tóth :

> Hi,
>
> I ran some tests regarding Spark's Delegation Token renewal mechanism. As
> I see, the concept here is simple: if I give my keytab file and client
> principal to Spark, it starts a token renewal thread, and renews the
> namenode delegation tokens after some time. This works fine.
>
> Then I tried to run a long application (with HDFS operation in the end)
> without providing the keytab/principal to Spark, and I expected it to fail
> after the token expires. It turned out that this is not the case, the
> application finishes successfully without a delegation token renewal by
> Spark.
>
> My question is: how is that possible? Shouldn't a saveAsTextfile() fail
> after the namenode delegation token expired?
>
> Regards,
> Zsolt
>


Re: Aggregation Calculation

2016-11-03 Thread Andrés Ivaldi
I'm not sure about inline views, it will still performing aggregation that
I don't need. I think I didn't explain right, I've already filtered the
values that I need, the problem is that default calculation of rollUp give
me some calculations that I don't want like only aggregation by the second
column.
Suppose tree columns (DataSet Columns) Year, Moth, Import, and I want
aggregation sum(Import), and the combination of all Year/Month Sum(import),
also Year Sum(import), but Mont Sum(import) doesn't care

in table it will looks like

YEAR | MOTH | Sum(Import)
2006 | 1| 
2005 | 1| 
2005 | 2| 
2006 | null | 
2005 | null | 
null | null | 
null | 1| 
null | 2| 

the las tree rows are not needed, in this example I could perform filtering
after rollUp i do the query by demand  so it will grow depending on number
of rows and columns, and will be a lot of combinations that I don't need.

thanks





On Thu, Nov 3, 2016 at 4:04 PM, Stephen Boesch  wrote:

> You would likely want to create inline views that perform the filtering 
> *before
> *performing t he cubes/rollup; in this way the cubes/rollups only operate
> on the pruned rows/columns.
>
> 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi :
>
>> Hello, I need to perform some aggregations and a kind of Cube/RollUp
>> calculation
>>
>> Doing some test looks like Cube and RollUp performs aggregation over all
>> posible columns combination, but I just need some specific columns
>> combination.
>>
>> What I'm trying to do is like a dataTable where te first N columns are
>> may rows and the second M values are my columns and the last columna are
>> the aggregated values, like Dimension / Measures
>>
>> I need all the values of the N and M columns and the ones that correspond
>> to the aggregation function. I'll never need the values that previous
>> column has no value, ie
>>
>> having N=2 so two columns as rows I'll need
>> R1 | R2  
>> ##  |  ## 
>> ##  |   null 
>>
>> but not
>> null | ## 
>>
>> as roll up does, same approach to M columns
>>
>>
>> So the question is what could be the better way to perform this
>> calculation.
>> Using rollUp/Cube give me a lot of values that I dont need
>> Using groupBy give me less information ( I could do several groupBy but
>> that is not performant, I think )
>> Is any other way to something like that?
>>
>> Thanks.
>>
>>
>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Re: incomplete aggregation in a GROUP BY

2016-11-03 Thread Michael Armbrust
Sounds like a bug, if you can reproduce on 1.6.3 (currently being voted
on), then please open a JIRA.

On Thu, Nov 3, 2016 at 8:05 AM, Donald Matthews  wrote:

> While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a
> HiveContext GROUP BY that no longer works reliably.
> The GROUP BY results are not always fully aggregated; instead, I get lots
> of duplicate + triplicate sets of group values.
>
> I've come up with a workaround that works for me, but the behaviour in
> question seems like a Spark bug, and since I don't see anything matching
> this in the Spark Jira or on this list, I thought I should check with this
> list to see if it's a known issue or if it might be worth creating a ticket
> for.
>
> Input:  A single table with 24 fields that I want to group on, and a few
> other fields that I want to aggregate.
>
> Statement: similar to hiveContext.sql("""
> SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
> FROM inputTable
> GROUP BY a,b,c, ..., x
> """)
>
> Checking the data for one sample run, I see that the input table has about
> 1.1M rows, with 18157 unique combinations of those 24 grouped values.
>
> Expected output: A table of 18157 rows.
>
> Observed output: A table of 28006 rows. Looking just at unique
> combinations of those grouped fields, I see that while 10125 rows are
> unique as expected, there are 6215 duplicate rows and 1817 triplicate rows.
>
> This is not quite 100% repeatable. That is, I'll see the issue repeatedly
> one day, but the next day with the same input data the GROUP BY will work
> correctly.
>
> For now it seems that I have a workaround: if I presort the input table on
> those grouped fields, the GROUP BY works correctly. But of course I
> shouldn't have to do that.
>
> Does this sort of GROUP BY issue seem familiar to anyone?
>
> /drm
>


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
You are looking to perform an *argmax*, which you can do with a single
aggregation.  Here is an example

.

On Thu, Nov 3, 2016 at 4:53 AM, Rabin Banerjee  wrote:

> Hi All ,
>
>   I want to do a dataframe operation to find the rows having the latest
> timestamp in each group using the below operation
>
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
>
>
> *Spark Version :: 1.6.x*
>
> My Question is *"Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??*
>
>
> *I referred a blog here :: 
> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> *
>
> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>
>
> *I need a bit elaboration of how internally spark handles it ? also is it 
> more efficient than using a Window function ?*
>
>
> *Thanks in Advance ,*
>
> *Rabin Banerjee*
>
>
>
>


How do I specify StorageLevel in KafkaUtils.createDirectStream?

2016-11-03 Thread kant kodali
JavaInputDStream> directKafkaStream =
KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent(),

ConsumerStrategies.Subscribe(topics, kafkaParams));


Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi Baki,
It's enough for the producer to write the messages compressed. See here:
https://cwiki.apache.org/confluence/display/KAFKA/Compression

Thank you.
Daniel

> On 3 Nov 2016, at 21:27, Daniel Haviv  wrote:
> 
> Hi,
> Kafka can compress/uncompress your messages for you seamlessly, adding 
> compression on top of that will be redundant.
> 
> Thank you.
> Daniel
> 
>> On 3 Nov 2016, at 20:53, bhayat  wrote:
>> 
>> Hello,
>> 
>> I really wonder that whether i can stream compressed data with using
>> KafkaUtils.createDirectStream(...) or not.
>> 
>> This is formal code that i use ;
>> 
>> JavaPairInputDStream messages =
>> KafkaUtils.createStream(javaStreamingContext, zookeeperConfiguration,
>> groupName, topicMap, StorageLevel.MEMORY_AND_DISK_SER());
>> 
>> final JavaDStream lines = messages.map(new
>> Function, String>() {
>>@Override
>>public String call(Tuple2 tuple2) {
>>System.out.println("Stream Received: " + tuple2._2());
>>return tuple2._2();
>>}
>>});
>> 
>> in this code i am consuming string message but in my case i need to consume
>> compressed data from stream then uncompress it and finally read string
>> message.
>> 
>> Could you please asist me how i can go in right way ? ı am a bit confused
>> that whether spark streaming has this ability or not.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Stream-compressed-data-from-KafkaUtils-createDirectStream-tp28010.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi,
Kafka can compress/uncompress your messages for you seamlessly, adding 
compression on top of that will be redundant.

Thank you.
Daniel

> On 3 Nov 2016, at 20:53, bhayat  wrote:
> 
> Hello,
> 
> I really wonder that whether i can stream compressed data with using
> KafkaUtils.createDirectStream(...) or not.
> 
> This is formal code that i use ;
> 
> JavaPairInputDStream messages =
> KafkaUtils.createStream(javaStreamingContext, zookeeperConfiguration,
> groupName, topicMap, StorageLevel.MEMORY_AND_DISK_SER());
> 
> final JavaDStream lines = messages.map(new
> Function, String>() {
>@Override
>public String call(Tuple2 tuple2) {
>System.out.println("Stream Received: " + tuple2._2());
>return tuple2._2();
>}
>});
> 
> in this code i am consuming string message but in my case i need to consume
> compressed data from stream then uncompress it and finally read string
> message.
> 
> Could you please asist me how i can go in right way ? ı am a bit confused
> that whether spark streaming has this ability or not.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Stream-compressed-data-from-KafkaUtils-createDirectStream-tp28010.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


Re: Aggregation Calculation

2016-11-03 Thread Stephen Boesch
You would likely want to create inline views that perform the filtering *before
*performing t he cubes/rollup; in this way the cubes/rollups only operate
on the pruned rows/columns.

2016-11-03 11:29 GMT-07:00 Andrés Ivaldi :

> Hello, I need to perform some aggregations and a kind of Cube/RollUp
> calculation
>
> Doing some test looks like Cube and RollUp performs aggregation over all
> posible columns combination, but I just need some specific columns
> combination.
>
> What I'm trying to do is like a dataTable where te first N columns are may
> rows and the second M values are my columns and the last columna are the
> aggregated values, like Dimension / Measures
>
> I need all the values of the N and M columns and the ones that correspond
> to the aggregation function. I'll never need the values that previous
> column has no value, ie
>
> having N=2 so two columns as rows I'll need
> R1 | R2  
> ##  |  ## 
> ##  |   null 
>
> but not
> null | ## 
>
> as roll up does, same approach to M columns
>
>
> So the question is what could be the better way to perform this
> calculation.
> Using rollUp/Cube give me a lot of values that I dont need
> Using groupBy give me less information ( I could do several groupBy but
> that is not performant, I think )
> Is any other way to something like that?
>
> Thanks.
>
>
>
>
>
> --
> Ing. Ivaldi Andres
>


Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread bhayat
Hello,

I really wonder that whether i can stream compressed data with using
KafkaUtils.createDirectStream(...) or not.

This is formal code that i use ;

JavaPairInputDStream messages =
KafkaUtils.createStream(javaStreamingContext, zookeeperConfiguration,
groupName, topicMap, StorageLevel.MEMORY_AND_DISK_SER());

 final JavaDStream lines = messages.map(new
Function, String>() {
@Override
public String call(Tuple2 tuple2) {
System.out.println("Stream Received: " + tuple2._2());
return tuple2._2();
}
});

in this code i am consuming string message but in my case i need to consume
compressed data from stream then uncompress it and finally read string
message.

Could you please asist me how i can go in right way ? ı am a bit confused
that whether spark streaming has this ability or not.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stream-compressed-data-from-KafkaUtils-createDirectStream-tp28010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fwd: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread baki hayat
Hello,

I really wonder that whether i can stream compressed data with using
KafkaUtils.createDirectStream(...) or not.

This is formal code that i use ;

JavaPairInputDStream messages =
KafkaUtils.createStream(javaStreamingContext, zookeeperConfiguration,
groupName, topicMap, StorageLevel.MEMORY_AND_DISK_SER());

in this code i am consuming string message but in my case i need to
consume compressed data from stream then uncompress it.

Could you please asist me how i can go in right way ?


-- 
bhayat


Aggregation Calculation

2016-11-03 Thread Andrés Ivaldi
Hello, I need to perform some aggregations and a kind of Cube/RollUp
calculation

Doing some test looks like Cube and RollUp performs aggregation over all
posible columns combination, but I just need some specific columns
combination.

What I'm trying to do is like a dataTable where te first N columns are may
rows and the second M values are my columns and the last columna are the
aggregated values, like Dimension / Measures

I need all the values of the N and M columns and the ones that correspond
to the aggregation function. I'll never need the values that previous
column has no value, ie

having N=2 so two columns as rows I'll need
R1 | R2  
##  |  ## 
##  |   null 

but not
null | ## 

as roll up does, same approach to M columns


So the question is what could be the better way to perform this calculation.
Using rollUp/Cube give me a lot of values that I dont need
Using groupBy give me less information ( I could do several groupBy but
that is not performant, I think )
Is any other way to something like that?

Thanks.





-- 
Ing. Ivaldi Andres


example LDA code ClassCastException

2016-11-03 Thread jamborta
Hi there,

I am trying to run the example LDA code
(http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda)
on Spark 2.0.0/EMR 5.0.0

If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/")

ldaModel = LDA.train(corpus, k=3, maxIterations=200, checkpointInterval=10)

I get the following error (sorry, quite long): 

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 ldaModel = LDA.train(corpus, k=3, maxIterations=200,
checkpointInterval=10)

/usr/lib/spark/python/pyspark/mllib/clustering.py in train(cls, rdd, k,
maxIterations, docConcentration, topicConcentration, seed,
checkpointInterval, optimizer)
   1037 model = callMLlibFunc("trainLDAModel", rdd, k,
maxIterations,
   1038   docConcentration, topicConcentration,
seed,
-> 1039   checkpointInterval, optimizer)
   1040 return LDAModel(model)
   1041 

/usr/lib/spark/python/pyspark/mllib/common.py in callMLlibFunc(name, *args)
128 sc = SparkContext.getOrCreate()
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131 
132 

/usr/lib/spark/python/pyspark/mllib/common.py in callJavaFunc(sc, func,
*args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124 
125 

/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in
__call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934 
935 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(

Py4JJavaError: An error occurred while calling o115.trainLDAModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 458.0 failed 4 times, most recent failure: Lost task 1.3 in stage
458.0 (TID 14827, ip-10-197-192-2.eu-west-1.compute.internal):
java.lang.ClassCastException: scala.Tuple2 cannot be cast to
org.apache.spark.graphx.Edge
at
org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPa

Re: mLIb solving linear regression with sparse inputs

2016-11-03 Thread Robineast
Any reason why you can’t use built in linear regression e.g. 
http://spark.apache.org/docs/latest/ml-classification-regression.html#regression
 or 
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression?

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 3 Nov 2016, at 16:08, im281 [via Apache Spark User List] 
>  wrote:
> 
> I want to solve the linear regression problem using spark with huge 
> martrices: 
> 
> Ax = b 
> using least squares: 
> x = Inverse(A-transpose) * A)*A-transpose *b 
> 
> The A matrix is a large sparse matrix (as is the b vector). 
> 
> I have pondered several solutions to the Ax = b problem including: 
> 
> 1) directly solving the problem above where the matrix is transposed, 
> multiplied by itself, the inverse is taken and then multiplied by A-transpose 
> and then multiplied by b which will give the solution vector x 
> 
> 2) iterative solver (no need to take the inverse) 
> 
> My question is:
> 
> What is the best way to solve this problem using the MLib libraries, in JAVA 
> and using RDD and spark? 
> 
> Is there any code as an example? Has anyone done this? 
> 
> 
> 
> 
> 
> The code to take in data represented as a coordinate matrix and perform 
> transposition and multiplication is shown below but I need to take the 
> inverse if I use this strategy: 
> 
> //Read coordinate matrix from text or database 
> JavaRDD fileA = sc.textFile(file); 
> 
> //map text file with coordinate data (sparse matrix) to 
> JavaRDD
> JavaRDD matrixA = fileA.map(new Function MatrixEntry>() { 
> public MatrixEntry call(String x){ 
> String[] indeceValue = x.split(","); 
> long i = Long.parseLong(indeceValue[0]); 
> long j = Long.parseLong(indeceValue[1]); 
> double value = Double.parseDouble(indeceValue[2]); 
> return new MatrixEntry(i, j, value ); 
> } 
> }); 
> 
> //coordinate matrix from sparse data 
> CoordinateMatrix cooMatrixA = new 
> CoordinateMatrix(matrixA.rdd()); 
> 
> //create block matrix 
> BlockMatrix matA = cooMatrixA.toBlockMatrix(); 
> 
> //create block matrix after matrix multiplication (square 
> matrix) 
> BlockMatrix ata = matA.transpose().multiply(matA); 
> 
> //print out the original dense matrix 
> System.out.println(matA.toLocalMatrix().toString()); 
> 
> //print out the transpose of the dense matrix 
> 
> System.out.println(matA.transpose().toLocalMatrix().toString()); 
> 
> //print out the square matrix (after multiplication) 
> System.out.println(ata.toLocalMatrix().toString()); 
> 
> JavaRDD entries = 
> ata.toCoordinateMatrix().entries().toJavaRDD(); 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006.html
>  
> 
> To start a new topic under Apache Spark User List, email 
> ml-node+s1001560n1...@n3.nabble.com 
> To unsubscribe from Apache Spark User List, click here 
> .
> NAML 
> 




-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Silvio Fiorito
Hi Nishit,


Yes the JDBC connector supports predicate pushdown and column pruning. So any 
selection you make on the dataframe will get materialized in the query sent via 
JDBC.


You should be able to verify this by looking at the physical query plan:


val df = sqlContext.jdbc().select($"col1", $"col2")

df.explain(true)


Or if you can easily log queries submitted to your database then you can view 
the specific query.


Thanks,

Silvio


From: Jain, Nishit 
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ? Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ? T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee mailto:denny.g@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" mailto:nja...@underarmour.com>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Jain, Nishit
Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few 
columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only 
certain columns:
def jdbc(url: String, table: String, predicates: Array[String], 
connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query 
(instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, 
lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T 
= JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will 
spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee mailto:denny.g@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" mailto:nja...@underarmour.com>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a 
BroadcastHashJoin so that way you can join to that table presuming its small 
enough to distributed?  Here's a handy guide on a BroadcastHashJoin: 
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit 
mailto:nja...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


PySpark 2: Kmeans The input data is not directly cached

2016-11-03 Thread Zakaria Hili
Hi,

I dont know why I receive the message

 WARN KMeans: The input data is not directly cached, which may hurt
performance if its parent RDDs are also uncached.

when I try to use Spark Kmeans

df_Part = assembler.transform(df_Part)
df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):
kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1

It says that my input (Dataframe) is not cached !!

I tried to print df_Part.is_cached and I recieved True which means that my
dataframe is cached, So why spark still warning me about this ???

thank you in advance


ᐧ


Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee
If you're able to read the data in as a DataFrame, perhaps you can use a
BroadcastHashJoin so that way you can join to that table presuming its
small enough to distributed?  Here's a handy guide on a BroadcastHashJoin:
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit  wrote:

> I have a lookup table in HANA database. I want to create a spark broadcast
> variable for it.
> What would be the suggested approach? Should I read it as an data frame
> and convert data frame into broadcast variable?
>
> Thanks,
> Nishit
>


unsubscribe

2016-11-03 Thread Venkatesh Seshan
unsubscribe


How do I convert a data frame to broadcast variable?

2016-11-03 Thread Jain, Nishit
I have a lookup table in HANA database. I want to create a spark broadcast 
variable for it.
What would be the suggested approach? Should I read it as an data frame and 
convert data frame into broadcast variable?

Thanks,
Nishit


Unsubscribe

2016-11-03 Thread Bibudh Lahiri
Unsubscribe

-- 
Bibudh Lahiri
Senior Data Scientist, Impetus Technolgoies
720 University Avenue, Suite 130
Los Gatos, CA 95129
http://knowthynumbers.blogspot.com/


Re: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Michael Segel
Hi,

I understand.

With XML, if you know the tag you want to group by, you can use a multi-line 
input format and just advance in to the split until you find that tag.
Much more difficult in JSON.




On Nov 3, 2016, at 2:41 AM, Mendelson, Assaf 
mailto:assaf.mendel...@rsa.com>> wrote:

I agree this can be a little annoying. The reason this is done this way is to 
enable cases where the json file is huge. To allow splitting it, a separator is 
needed and newline is the separator used (as is done in all text files in 
Hadoop and spark).
I always wondered why support has not been implemented for cases where each 
file is small (e.g. has one object) but the implementation now assume each line 
has a legal json object.

Why I do to overcome this is use RDDs (using pyspark):

// get an RDD of the text context. The map is used because wholeTextFiles 
returns a tuple of filename, file content
jsonRDD = sc.wholeTextFiles(filename).map(lambda x: x[1])

// remove whitespaces. This can actually be too much as it would also work 
inside string info so you can maybe remove just the end line characters (e.g. 
\r, \n)
import re
js = jsonRDD.map(lambda x: re.sub(r"\s+", "", x, flags=re.UNICODE))

// convert the rdd to dataframe. If you have your own schema, this is where you 
should add it.
df = spark.read.json(js)

Assaf.

From: Michael Segel [mailto:msegel_had...@hotmail.com]
Sent: Wednesday, November 02, 2016 9:39 PM
To: Daniel Siegmann
Cc: user @spark
Subject: Re: Quirk in how Spark DF handles JSON input records?


On Nov 2, 2016, at 2:22 PM, Daniel Siegmann 
mailto:dsiegm...@securityscorecard.io>> wrote:

Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines 
as a record separator by default. While it is possible to use a different 
string as a record separator, what would you use in the case of JSON?
If you do some Googling I suspect you'll find some possible solutions. 
Personally, I would just use a separate JSON library (e.g. json4s) to parse 
this metadata into an object, rather than trying to read it in through Spark.


Yeah, that’s the basic idea.

This JSON is metadata to help drive the process not row records… although the 
column descriptors are row records so in the short term I could cheat and just 
store those in a file.

:-(


--
Daniel Siegmann
Senior Software Engineer
SecurityScorecard Inc.
214 W 29th Street, 5th Floor
New York, NY 10001



Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I agree with Koert. Relying on something because it appears to work when you 
test it can be dangerous if there is nothing in the api guarantee. 

Going back quite a few years it used to be the case that Oracle would always 
return a group by with the rows in the order of the grouping key. This was the 
result of the implementation specifics of GROUP BY. Then at some point Oracle 
introduce a new hashing GROUP BY mechanism that could be chosen for the cost 
based optimizer and all of a sudden lots of people’s applications ‘broke’ 
because they had been relying on functionality that had always worked in the 
past but wasn’t actually guaranteed.

TLDR - don’t rely on functionality that isn’t specified


> On 3 Nov 2016, at 14:37, Koert Kuipers  wrote:
> 
> i did not check the claim in that blog post that the data is ordered, but i 
> wouldnt rely on that behavior since it is not something the api guarantees 
> and could change in future versions
> 
> On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee  > wrote:
> Hi Koert & Robin ,
> 
>   Thanks ! But if you go through the blog 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> 
>  and check the comments under the blog it's actually working, although I am 
> not sure how . And yes I agree a custom aggregate UDAF is a good option . 
> 
> Can anyone share the best way to implement this in Spark .?
> 
> Regards,
> Rabin Banerjee 
> 
> On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers  > wrote:
> Just realized you only want to keep first element. You can do this without 
> sorting by doing something similar to min or max operation using a custom 
> aggregator/udaf or reduceGroups on Dataset. This is also more efficient.
> 
> 
> On Nov 3, 2016 7:53 AM, "Rabin Banerjee"  > wrote:
> Hi All ,
> 
>   I want to do a dataframe operation to find the rows having the latest 
> timestamp in each group using the below operation 
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
> 
> Spark Version :: 1.6.x
> My Question is "Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??
> 
> I referred a blog here :: 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> 
> Which claims it will work except in Spark 1.5.1 and 1.5.2 .
> 
> I need a bit elaboration of how internally spark handles it ? also is it more 
> efficient than using a Window function ?
> 
> Thanks in Advance ,
> Rabin Banerjee
> 
> 
> 
> 



incomplete aggregation in a GROUP BY

2016-11-03 Thread Donald Matthews
While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a
HiveContext GROUP BY that no longer works reliably.
The GROUP BY results are not always fully aggregated; instead, I get lots
of duplicate + triplicate sets of group values.

I've come up with a workaround that works for me, but the behaviour in
question seems like a Spark bug, and since I don't see anything matching
this in the Spark Jira or on this list, I thought I should check with this
list to see if it's a known issue or if it might be worth creating a ticket
for.

Input:  A single table with 24 fields that I want to group on, and a few
other fields that I want to aggregate.

Statement: similar to hiveContext.sql("""
SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
FROM inputTable
GROUP BY a,b,c, ..., x
""")

Checking the data for one sample run, I see that the input table has about
1.1M rows, with 18157 unique combinations of those 24 grouped values.

Expected output: A table of 18157 rows.

Observed output: A table of 28006 rows. Looking just at unique combinations
of those grouped fields, I see that while 10125 rows are unique as
expected, there are 6215 duplicate rows and 1817 triplicate rows.

This is not quite 100% repeatable. That is, I'll see the issue repeatedly
one day, but the next day with the same input data the GROUP BY will work
correctly.

For now it seems that I have a workaround: if I presort the input table on
those grouped fields, the GROUP BY works correctly. But of course I
shouldn't have to do that.

Does this sort of GROUP BY issue seem familiar to anyone?

/drm


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i did not check the claim in that blog post that the data is ordered, but i
wouldnt rely on that behavior since it is not something the api guarantees
and could change in future versions

On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee  wrote:

> Hi Koert & Robin ,
>
> *  Thanks ! *But if you go through the blog https://bzhangusc.
> wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and
> check the comments under the blog it's actually working, although I am not
> sure how . And yes I agree a custom aggregate UDAF is a good option .
>
> Can anyone share the best way to implement this in Spark .?
>
> Regards,
> Rabin Banerjee
>
> On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers  wrote:
>
>> Just realized you only want to keep first element. You can do this
>> without sorting by doing something similar to min or max operation using a
>> custom aggregator/udaf or reduceGroups on Dataset. This is also more
>> efficient.
>>
>> On Nov 3, 2016 7:53 AM, "Rabin Banerjee" 
>> wrote:
>>
>>> Hi All ,
>>>
>>>   I want to do a dataframe operation to find the rows having the latest
>>> timestamp in each group using the below operation
>>>
>>> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
>>> .select("customername","service_type","mobileno","cust_addr")
>>>
>>>
>>> *Spark Version :: 1.6.x*
>>>
>>> My Question is *"Will Spark guarantee the Order while doing the groupBy , 
>>> if DF is ordered using OrderBy previously in Spark 1.6.x"??*
>>>
>>>
>>> *I referred a blog here :: 
>>> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>>>  
>>> *
>>>
>>> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>>>
>>>
>>> *I need a bit elaboration of how internally spark handles it ? also is it 
>>> more efficient than using a Window function ?*
>>>
>>>
>>> *Thanks in Advance ,*
>>>
>>> *Rabin Banerjee*
>>>
>>>
>>>
>>>
>


Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Hi,

I ran some tests regarding Spark's Delegation Token renewal mechanism. As I
see, the concept here is simple: if I give my keytab file and client
principal to Spark, it starts a token renewal thread, and renews the
namenode delegation tokens after some time. This works fine.

Then I tried to run a long application (with HDFS operation in the end)
without providing the keytab/principal to Spark, and I expected it to fail
after the token expires. It turned out that this is not the case, the
application finishes successfully without a delegation token renewal by
Spark.

My question is: how is that possible? Shouldn't a saveAsTextfile() fail
after the namenode delegation token expired?

Regards,
Zsolt


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread ayan guha
I would go for partition by option. It seems simple and yes, SQL inspired
:)
On 4 Nov 2016 00:59, "Rabin Banerjee"  wrote:

> Hi Koert & Robin ,
>
> *  Thanks ! *But if you go through the blog https://bzhangusc.
> wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and
> check the comments under the blog it's actually working, although I am not
> sure how . And yes I agree a custom aggregate UDAF is a good option .
>
> Can anyone share the best way to implement this in Spark .?
>
> Regards,
> Rabin Banerjee
>
> On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers  wrote:
>
>> Just realized you only want to keep first element. You can do this
>> without sorting by doing something similar to min or max operation using a
>> custom aggregator/udaf or reduceGroups on Dataset. This is also more
>> efficient.
>>
>> On Nov 3, 2016 7:53 AM, "Rabin Banerjee" 
>> wrote:
>>
>>> Hi All ,
>>>
>>>   I want to do a dataframe operation to find the rows having the latest
>>> timestamp in each group using the below operation
>>>
>>> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
>>> .select("customername","service_type","mobileno","cust_addr")
>>>
>>>
>>> *Spark Version :: 1.6.x*
>>>
>>> My Question is *"Will Spark guarantee the Order while doing the groupBy , 
>>> if DF is ordered using OrderBy previously in Spark 1.6.x"??*
>>>
>>>
>>> *I referred a blog here :: 
>>> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>>>  
>>> *
>>>
>>> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>>>
>>>
>>> *I need a bit elaboration of how internally spark handles it ? also is it 
>>> more efficient than using a Window function ?*
>>>
>>>
>>> *Thanks in Advance ,*
>>>
>>> *Rabin Banerjee*
>>>
>>>
>>>
>>>
>


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi Koert & Robin ,

*  Thanks ! *But if you go through the blog https://bzhangusc.wordpress.co
m/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and check the
comments under the blog it's actually working, although I am not sure how .
And yes I agree a custom aggregate UDAF is a good option .

Can anyone share the best way to implement this in Spark .?

Regards,
Rabin Banerjee

On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers  wrote:

> Just realized you only want to keep first element. You can do this without
> sorting by doing something similar to min or max operation using a custom
> aggregator/udaf or reduceGroups on Dataset. This is also more efficient.
>
> On Nov 3, 2016 7:53 AM, "Rabin Banerjee" 
> wrote:
>
>> Hi All ,
>>
>>   I want to do a dataframe operation to find the rows having the latest
>> timestamp in each group using the below operation
>>
>> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
>> .select("customername","service_type","mobileno","cust_addr")
>>
>>
>> *Spark Version :: 1.6.x*
>>
>> My Question is *"Will Spark guarantee the Order while doing the groupBy , if 
>> DF is ordered using OrderBy previously in Spark 1.6.x"??*
>>
>>
>> *I referred a blog here :: 
>> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>>  
>> *
>>
>> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>>
>>
>> *I need a bit elaboration of how internally spark handles it ? also is it 
>> more efficient than using a Window function ?*
>>
>>
>> *Thanks in Advance ,*
>>
>> *Rabin Banerjee*
>>
>>
>>
>>


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Just realized you only want to keep first element. You can do this without
sorting by doing something similar to min or max operation using a custom
aggregator/udaf or reduceGroups on Dataset. This is also more efficient.

On Nov 3, 2016 7:53 AM, "Rabin Banerjee" 
wrote:

> Hi All ,
>
>   I want to do a dataframe operation to find the rows having the latest
> timestamp in each group using the below operation
>
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
>
>
> *Spark Version :: 1.6.x*
>
> My Question is *"Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??*
>
>
> *I referred a blog here :: 
> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> *
>
> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>
>
> *I need a bit elaboration of how internally spark handles it ? also is it 
> more efficient than using a Window function ?*
>
>
> *Thanks in Advance ,*
>
> *Rabin Banerjee*
>
>
>
>


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
What you require is secondary sort which is not available as such for a
DataFrame. The Window operator is what comes closest but it is strangely
limited in its abilities (probably because it was inspired by a SQL
construct instead of a more generic programmatic transformation capability).

On Nov 3, 2016 7:53 AM, "Rabin Banerjee" 
wrote:

> Hi All ,
>
>   I want to do a dataframe operation to find the rows having the latest
> timestamp in each group using the below operation
>
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
>
>
> *Spark Version :: 1.6.x*
>
> My Question is *"Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??*
>
>
> *I referred a blog here :: 
> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> *
>
> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>
>
> *I need a bit elaboration of how internally spark handles it ? also is it 
> more efficient than using a Window function ?*
>
>
> *Thanks in Advance ,*
>
> *Rabin Banerjee*
>
>
>
>


Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I don’t think the semantics of groupBy necessarily preserve ordering - whatever 
the implementation details or the observed behaviour. I would use a Window 
operation and order within the group.




> On 3 Nov 2016, at 11:53, Rabin Banerjee  wrote:
> 
> Hi All ,
> 
>   I want to do a dataframe operation to find the rows having the latest 
> timestamp in each group using the below operation 
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
> 
> Spark Version :: 1.6.x
> My Question is "Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??
> 
> I referred a blog here :: 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> 
> Which claims it will work except in Spark 1.5.1 and 1.5.2 .
> 
> I need a bit elaboration of how internally spark handles it ? also is it more 
> efficient than using a Window function ?
> 
> Thanks in Advance ,
> Rabin Banerjee
> 
> 



Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi All ,

  I want to do a dataframe operation to find the rows having the latest
timestamp in each group using the below operation

df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
.select("customername","service_type","mobileno","cust_addr")


*Spark Version :: 1.6.x*

My Question is *"Will Spark guarantee the Order while doing the
groupBy , if DF is ordered using OrderBy previously in Spark 1.6.x"??*


*I referred a blog here ::
**https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
*

*Which claims it will work except in Spark 1.5.1 and 1.5.2 .*


*I need a bit elaboration of how internally spark handles it ? also is
it more efficient than using a Window function ?*


*Thanks in Advance ,*

*Rabin Banerjee*


Re: How to return a case class in map function?

2016-11-03 Thread Yan Facai
2.0.1 has fixed the bug.
Thanks very much.

On Thu, Nov 3, 2016 at 6:22 PM, 颜发才(Yan Facai)  wrote:

> Thanks, Armbrust.
> I'm using 2.0.0.
> Does 2.0.1 stable version fix it?
>
> On Thu, Nov 3, 2016 at 2:01 AM, Michael Armbrust 
> wrote:
>
>> Thats a bug.  Which version of Spark are you running?  Have you tried
>> 2.0.2?
>>
>> On Wed, Nov 2, 2016 at 12:01 AM, 颜发才(Yan Facai)  wrote:
>>
>>> Hi, all.
>>> When I use a case class as return value in map function, spark always
>>> raise a ClassCastException.
>>>
>>> I write an demo, like:
>>>
>>> scala> case class Record(key: Int, value: String)
>>>
>>> scala> case class ID(key: Int)
>>>
>>> scala> val df = Seq(Record(1, "a"), Record(2, "b")).toDF
>>>
>>> scala> df.map{x => ID(x.getInt(0))}.show
>>>
>>> 16/11/02 14:52:34 ERROR Executor: Exception in task 0.0 in stage 166.0
>>> (TID 175)
>>> java.lang.ClassCastException: $line1401.$read$$iw$$iw$ID cannot be cast
>>> to $line1401.$read$$iw$$iw$ID
>>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>>> eratedIterator.processNext(Unknown Source)
>>>
>>>
>>> Please tell me if I'm wrong.
>>> Thanks.
>>>
>>>
>>
>>
>


LinearRegressionWithSGD and Rank Features By Importance

2016-11-03 Thread Carlo . Allocca
Hi All,

I am using SPARK and in particular the MLib library.

import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.regression.LinearRegressionModel;
import org.apache.spark.mllib.regression.LinearRegressionWithSGD;

For my problem I am using the LinearRegressionWithSGD and I would like to 
perform a “Rank Features By Importance”.

I checked the documentation and it seems that does not provide such methods.

Am I missing anything?  Please, could you provide any help on this?
Should I change the approach?

Many Thanks in advance,

Best Regards,
Carlo


-- The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302). 
The Open University is authorised and regulated by the Financial Conduct 
Authority.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to return a case class in map function?

2016-11-03 Thread Yan Facai
Thanks, Armbrust.
I'm using 2.0.0.
Does 2.0.1 stable version fix it?

On Thu, Nov 3, 2016 at 2:01 AM, Michael Armbrust 
wrote:

> Thats a bug.  Which version of Spark are you running?  Have you tried
> 2.0.2?
>
> On Wed, Nov 2, 2016 at 12:01 AM, 颜发才(Yan Facai)  wrote:
>
>> Hi, all.
>> When I use a case class as return value in map function, spark always
>> raise a ClassCastException.
>>
>> I write an demo, like:
>>
>> scala> case class Record(key: Int, value: String)
>>
>> scala> case class ID(key: Int)
>>
>> scala> val df = Seq(Record(1, "a"), Record(2, "b")).toDF
>>
>> scala> df.map{x => ID(x.getInt(0))}.show
>>
>> 16/11/02 14:52:34 ERROR Executor: Exception in task 0.0 in stage 166.0
>> (TID 175)
>> java.lang.ClassCastException: $line1401.$read$$iw$$iw$ID cannot be cast
>> to $line1401.$read$$iw$$iw$ID
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.processNext(Unknown Source)
>>
>>
>> Please tell me if I'm wrong.
>> Thanks.
>>
>>
>
>


SparkSQL with Hive got "java.lang.NullPointerException"

2016-11-03 Thread lxw
Hi, exports:
 
   I use SparkSQL to query Hive tables, this query throws NPE, but run OK with 
Hive.


SELECT city
FROM (
  SELECT city 
  FROM t_ad_fact a 
  WHERE a.pt = '2016-10-10' 
  limit 100
) x 
GROUP BY city;



Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:912)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:911)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:310)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:129)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:128)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.a

How to join dstream and JDBCRDD with checkpointing enabled

2016-11-03 Thread saurabh3d
Hi All,

We have a spark streaming job with checkpoint enabled, it executes correctly
first time, but throw below exception when restarted from checkpoint.

org.apache.spark.SparkException: RDD transformations and actions can only be
invoked by the driver, not inside of other transformations; for example,
rdd1.map(x => rdd2.values.count() * x) is invalid because the values
transformation and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
at org.apache.spark.rdd.RDD.union(RDD.scala:565)
at
org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
at
org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)

Please suggest any workaround for this issue. 

Code:
String URL = "jdbc:oracle:thin:" + USERNAME + "/" + PWD + "@//" +
CONNECTION_STRING;

Map options = ImmutableMap.of(
"driver", "oracle.jdbc.driver.OracleDriver",
"url", URL,
"dbtable", "READINGS_10K",
"fetchSize", "1");

DataFrame OracleDB_DF = sqlContext.load("jdbc", options);
JavaPairRDD OracleDB_RDD = OracleDB_DF.toJavaRDD()
.mapToPair(x -> new Tuple2(x.getString(0), x));

Dstream
.transformToPair(
rdd -> rdd
.mapToPair(
record ->
new Tuple2<>(
   
record.getKey().toString(),
record))
.join(OracleDB_RDD))
.print();

Spark version 1.6, running in yarn cluster mode.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-dstream-and-JDBCRDD-with-checkpointing-enabled-tp28001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Mendelson, Assaf
I agree this can be a little annoying. The reason this is done this way is to 
enable cases where the json file is huge. To allow splitting it, a separator is 
needed and newline is the separator used (as is done in all text files in 
Hadoop and spark).
I always wondered why support has not been implemented for cases where each 
file is small (e.g. has one object) but the implementation now assume each line 
has a legal json object.

Why I do to overcome this is use RDDs (using pyspark):

// get an RDD of the text context. The map is used because wholeTextFiles 
returns a tuple of filename, file content
jsonRDD = sc.wholeTextFiles(filename).map(lambda x: x[1])

// remove whitespaces. This can actually be too much as it would also work 
inside string info so you can maybe remove just the end line characters (e.g. 
\r, \n)
import re
js = jsonRDD.map(lambda x: re.sub(r"\s+", "", x, flags=re.UNICODE))

// convert the rdd to dataframe. If you have your own schema, this is where you 
should add it.
df = spark.read.json(js)

Assaf.

From: Michael Segel [mailto:msegel_had...@hotmail.com]
Sent: Wednesday, November 02, 2016 9:39 PM
To: Daniel Siegmann
Cc: user @spark
Subject: Re: Quirk in how Spark DF handles JSON input records?


On Nov 2, 2016, at 2:22 PM, Daniel Siegmann 
mailto:dsiegm...@securityscorecard.io>> wrote:

Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines 
as a record separator by default. While it is possible to use a different 
string as a record separator, what would you use in the case of JSON?
If you do some Googling I suspect you'll find some possible solutions. 
Personally, I would just use a separate JSON library (e.g. json4s) to parse 
this metadata into an object, rather than trying to read it in through Spark.


Yeah, that’s the basic idea.

This JSON is metadata to help drive the process not row records… although the 
column descriptors are row records so in the short term I could cheat and just 
store those in a file.

:-(


--
Daniel Siegmann
Senior Software Engineer
SecurityScorecard Inc.
214 W 29th Street, 5th Floor
New York, NY 10001



RE: Use a specific partition of dataframe

2016-11-03 Thread Mendelson, Assaf
There are a couple of tools you can use. Take a look at the various functions.
Specifically, limit might be useful for you and sample/sampleBy functions can 
make your data smaller.
Actually, when using CreateDataframe you can sample the data to begin with.

Specifically working by partitions can be done by moving through the RDD 
interface but I am not sure this is what you want. Actually working through a 
specific partition might mean seeing skewed data because of the hashing method 
used to partition (this would of course depend on how your dataframe was 
created).

Just to get smaller data sample/sampleBy seems like the best solution to me.

Assaf.

From: Yanwei Zhang [mailto:actuary_zh...@hotmail.com]
Sent: Wednesday, November 02, 2016 6:29 PM
To: user
Subject: Use a specific partition of dataframe

Is it possible to retrieve a specific partition  (e.g., the first partition) of 
a DataFrame and apply some function there? My data is too large, and I just 
want to get some approximate measures using the first few partitions in the 
data. I'll illustrate what I want to accomplish using the example below:

// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
  ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", 
"value")
// I want to get the first partition only, and do some calculation, for 
example, count by the value of "var"
tmp1 = tmp.getPartition(0)
tmp1.groupBy("var").count()

The idea is not to go through all the data to save computational time. So I am 
not sure whether mapPartitionsWithIndex is helpful in this case, since it still 
maps all data.

Regards,
Wayne




Is Spark launcher's listener API considered production ready?

2016-11-03 Thread Aseem Bansal
While using Spark launcher's listener we came across few cases where the
failures were not being reported correctly.


   - https://issues.apache.org/jira/browse/SPARK-17742
   - https://issues.apache.org/jira/browse/SPARK-18241

So just wanted to ensure whether this API considered production ready and
has anyone used it is production successfully? Has anyone used it? It seems
that we need to handle the failures ourselves. How did you handle the
failure cases?