Re: RV: Unintelligible warning arose out of the blue.

2018-05-04 Thread Marco Mistroni
Hi
  i think it has to do with spark configuration,  dont think the standard
configuration is geared up to be running in local mode on windows
your dataframe is ok, you can check out that you have read it successfully
by printing out df.count() and you will see your code is reading the
dataframe successfully

hth
m

On Fri, May 4, 2018 at 9:15 PM, Tomas Zubiri  wrote:

>
>
>
> --
> *De:* Tomas Zubiri
> *Enviado:* viernes, 04 de mayo de 2018 04:23 p.m.
> *Para:* user@spark.apache.org
> *Asunto:* Unintelligible warning arose out of the blue.
>
>
> My setup is as follows:
> Windows 10
> Python 3.6.5
> Spark 2.3.0
> The latest java jdk
> winutils/hadoop installed from this github page https://github.com/
> steveloughran/winutils
>
> I initialize spark from the pyspark shell as follows:
> df = spark.read.csv('mvce.csv')
>
>
> the mvce is a small file with 3 lines and 2 columns.
>
>
> The warning I receive is:
>
> 2018-05-04 16:17:44 WARN  ObjectStore:568 - Failed to get database
> global_temp, returning NoSuchObjectException
>
> What does this mean? I think it has something to do with YARN, but being
> an internal technology I have no clue about it. It doesn't seem to be
> causing any trouble, but I don't want to add the uncertainty that this
> might be causing an bug in future diagnosing of issues.
>
> Thank you for your help!
>
>
>


RV: Unintelligible warning arose out of the blue.

2018-05-04 Thread Tomas Zubiri




De: Tomas Zubiri
Enviado: viernes, 04 de mayo de 2018 04:23 p.m.
Para: user@spark.apache.org
Asunto: Unintelligible warning arose out of the blue.


My setup is as follows:
Windows 10
Python 3.6.5
Spark 2.3.0
The latest java jdk
winutils/hadoop installed from this github page 
https://github.com/steveloughran/winutils

I initialize spark from the pyspark shell as follows:
df = spark.read.csv('mvce.csv')


the mvce is a small file with 3 lines and 2 columns.


The warning I receive is:

2018-05-04 16:17:44 WARN  ObjectStore:568 - Failed to get database global_temp, 
returning NoSuchObjectException


What does this mean? I think it has something to do with YARN, but being an 
internal technology I have no clue about it. It doesn't seem to be causing any 
trouble, but I don't want to add the uncertainty that this might be causing an 
bug in future diagnosing of issues.

Thank you for your help!




Re: [pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Irving Duran
I could be wrong, but I think you can do a wild card.

df = spark.read.format('csv').load('/path/to/file*.csv.gz')

Thank You,

Irving Duran


On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
>
> I want to read multiple files parallely into 1 dataframe. But the files
> have random names and cannot confirm to any pattern (so I can't use
> wildcard). Also, the files can be in different directories.
> If I provide the file names in a list to the dataframe reader, it reads
> then sequentially.
> Eg:
> df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
> This reads the files sequentially. What can I do to read the files
> parallely?
> I noticed that spark reads files parallely if provided directly the
> directory location. How can that be extended to multiple random files?
> Suppose if my system has 4 cores, how can I make spark read 4 files at a
> time?
>
> Please suggest.
>


Re: Free Column Reference with $

2018-05-04 Thread Vadim Semenov
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala#L38-L47

It's called String Interpolation
See "Advanced Usage" here
https://docs.scala-lang.org/overviews/core/string-interpolation.html

On Fri, May 4, 2018 at 10:10 AM, Christopher Piggott 
wrote:

> How does $"something" actually work (from a scala perspective) as a free
> column reference?
>
>


-- 
Sent from my iPhone


Free Column Reference with $

2018-05-04 Thread Christopher Piggott
How does $"something" actually work (from a scala perspective) as a free
column reference?


Re: AccumulatorV2 vs AccumulableParam (V1)

2018-05-04 Thread Sergey Zhemzhitsky
Hi Wenchen,

Thanks a lot for clarification and help.

Here is what I mean regarding the remaining points

For 2: Should we update the documentation [1] regarding custom
accumulators to be more clear and to highlight that
  a) custom accumulators should always override "copy" method to
prevent unexpected behaviour with losing type information
  b) custom accumulators cannot be direct anonymous subclasses of
AccumulatorV2 because of a)
  c) extending already existing accumulators almost always requires
overriding "copy" because of a)

For 3: Here is [2] the sample that shows that the same
AccumulableParam can be registered twice with different names.
Here is [3] the sample that fails with IllegalStateException on this
line [4] because accumulator's metadata is not null and it's hardly
possible to reset it to null (there is no public API for such a
thing).
I understand, that Spark creates different Accumulators for the same
AccumulableParam internally and because of AccumulatorV2 is stateful
using the same stateful accumulator instance in multiple places for
different things is very dangerous, so maybe we should highlight this
point in the documentation too?

For 5: Should we raise a JIRA for that?


[1] https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators
[2] 
https://gist.github.com/szhem/52a26ada4bbeb1a3e762710adc3f94ef#file-accumulatorsspec-scala-L36
[3] 
https://gist.github.com/szhem/52a26ada4bbeb1a3e762710adc3f94ef#file-accumulatorsspec-scala-L59
[4] 
https://github.com/apache/spark/blob/4d5de4d303a773b1c18c350072344bd7efca9fc4/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L51


Kind Regards,
Sergey

On Thu, May 3, 2018 at 5:20 PM, Wenchen Fan  wrote:
> Hi Sergey,
>
> Thanks for your valuable feedback!
>
> For 1: yea this is definitely a bug and I have sent a PR to fix it.
> For 2: I have left my comments on the JIRA ticket.
> For 3: I don't quite understand it, can you give some concrete examples?
> For 4: yea this is a problem, but I think it's not a big deal, and we
> couldn't find a better solution at that time.
> For 5: I think this is a real problem. It looks to me that we can merge
> `isZero`, `copyAndReset`, `copy`, `reset` into one API: `zero`, which is
> basically just the `copyAndReset`. If there is a way to fix this without
> breaking the existing API, I'm really happy to do it.
> For 6: same as 4. It's a problem but not a big deal.
>
> In general, I think accumulator v2 sacrifices some flexibility to simplify
> the framework and improve the performance. Users can still use accumulator
> v1 if flexibility is more important to them. We can keep improving
> accumulator v2 without breaking backward compatibility.
>
> Thanks,
> Wenchen
>
> On Thu, May 3, 2018 at 6:20 AM, Sergey Zhemzhitsky 
> wrote:
>>
>> Hello guys,
>>
>> I've started to migrate my Spark jobs which use Accumulators V1 to
>> AccumulatorV2 and faced with the following issues:
>>
>> 1. LegacyAccumulatorWrapper now requires the resulting type of
>> AccumulableParam to implement equals. In other case the
>> AccumulableParam, automatically wrapped into LegacyAccumulatorWrapper,
>> will fail with AssertionError (SPARK-23697 [1]).
>>
>> 2. Existing AccumulatorV2 classes are hardly difficult to extend
>> easily and correctly (SPARK-24154 [2]) due to its "copy" method which
>> is called during serialization and usually loses type information of
>> descendant classes which don't override "copy" (and it's easier to
>> implement an accumulator from scratch than override it correctly)
>>
>> 3. The same instance of AccumulatorV2 cannot be used with the same
>> SparkContext multiple times (unlike AccumulableParam) failing with
>> "IllegalStateException: Cannot register an Accumulator twice" even
>> after "reset" method called. So it's impossible to unregister already
>> registered accumulator from user code.
>>
>> 4. AccumulableParam (V1) implementations are usually more or less
>> stateless, while AccumulatorV2 implementations are almost always
>> stateful, leading to (unnecessary?) type checks (unlike
>> AccumulableParam). For example typical "merge" method of AccumulatorV2
>> requires to check whether current accumulator is of an appropriate
>> type, like here [3]
>>
>> 5. AccumulatorV2 is more difficult to implement correctly unlike
>> AccumulableParam. For example, in case of AccumulableParam I have to
>> implement just 3 methods (addAccumulator, addInPlace, zero), in case
>> of AccumulableParam - just 2 methods (addInPlace, zero) and in case of
>> AccumulatorV2 - 6 methods (isZero, copy, reset, add, merge, value)
>>
>> 6. AccumulatorV2 classes are hardly possible to be anonymous classes,
>> because of their "copy" and "merge" methods which typically require a
>> concrete class to make a type check.
>>
>> I understand the motivation for AccumulatorV2 (SPARK-14654 [4]), but
>> just wondering whether there is a way to simplify the API of
>> AccumulatorV2 to 

[pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Shuporno Choudhury
Hi,

I want to read multiple files parallely into 1 dataframe. But the files
have random names and cannot confirm to any pattern (so I can't use
wildcard). Also, the files can be in different directories.
If I provide the file names in a list to the dataframe reader, it reads
then sequentially.
Eg:
df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
This reads the files sequentially. What can I do to read the files
parallely?
I noticed that spark reads files parallely if provided directly the
directory location. How can that be extended to multiple random files?
Suppose if my system has 4 cores, how can I make spark read 4 files at a
time?

Please suggest.


Re: Pickling Keras models for use in UDFs

2018-05-04 Thread Khaled Zaouk
Why don't you try to encapsulate your keras model within a wrapper class
(an estimator let's say), and you implement inside this wrapper class the
two functions: __getstate__ and __setstate__

On Thu, May 3, 2018 at 5:27 PM erp12  wrote:

> I would like to create a Spark UDF which returns the a prediction made
> with a
> trained Keras model. Keras models are not typically pickle-able, however I
> have used the monkey patch approach to making Keras models pickle-able, as
> described here: http://zachmoshe.com/2017/04/03/pickling-keras-models.html
>
> This allows for models to be sent from the PySpark driver to the workers,
> however the worker python processes do not have the monkey patched Model
> class, and thus cannot properly un-pickle the models. To fix this issue, I
> know I must call the monkey patching function (make_keras_picklable()) once
> on each worker, however I have been unable to figure out how to do this.
>
> I am curious to hear if anyone has a fix for this issue, or would like to
> offer an alternative way to make predictions with a Keras model within a
> Spark UDF.
>
> Here is a Stack Overflow question with more details:
>
> https://stackoverflow.com/questions/50007126/pickling-monkey-patched-keras-model-for-use-in-pyspark
>
> Thank you!
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


I cannot use spark 2.3.0 and kafka 0.9?

2018-05-04 Thread kant kodali
Hi All,

This link seems to suggest I cant use Spark 2.3.0 and Kafka 0.9 broker. is
that correct?

https://spark.apache.org/docs/latest/streaming-kafka-integration.html

Thanks!


SparkContext taking time after adding jars and asking yarn for resources

2018-05-04 Thread neeravsalaria
In my production setup spark is always taking 40 seconds between these steps
like a fixed counter is set. In my local lab these steps take exact 1
second. I am not able to find the exact root cause of this behaviour. My
Spark application is running  on Hortonworks platform in yarn client mode.
Can someone guide me what is happening between these steps 

18/05/04 *07:56:45* INFO spark.SparkContext: Added JAR
file:/app/jobs/jobs.jar at spark://10.233.69.5:37668/jars/jobs.jar with
timestamp 1525420605369
18/05/04 *07:57:26* WARN shortcircuit.DomainSocketFactory: The short-circuit
local reads feature cannot be used because libhadoop cannot be loaded.
18/05/04 *07:58:12* INFO client.AHSProxy: Connecting to Application History
server at gsidev001-mgt-01.thales.fr/192.168.1.11:10200



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-04 Thread kant kodali
1) I get an error when I set watermark to 0.
2) I set window and slide interval to 1 second with no watermark. It sill
aggregates messages from the previous batch that are in 1 second window.

so is it fair to say there is no declarative way to do stateless
aggregations?


On Thu, May 3, 2018 at 9:55 AM, Arun Mahadevan  wrote:

> I think you need to group by a window (tumbling) and define watermarks
> (put a very low watermark or even 0) to discard the state. Here the window
> duration becomes your logical batch.
>
> - Arun
>
> From: kant kodali 
> Date: Thursday, May 3, 2018 at 1:52 AM
> To: "user @spark" 
> Subject: Re: question on collect_list or say aggregations in general in
> structured streaming 2.3.0
>
> After doing some more research using Google. It's clear that aggregations
> by default are stateful in Structured Streaming. so the question now is how
> to do stateless aggregations(not storing the result from previous batches)
> using Structured Streaming 2.3.0? I am trying to do it using raw spark SQL
> so not using FlatMapsGroupWithState. And if that is not available then is
> it fair to say there is no declarative way to do stateless aggregations?
>
> On Thu, May 3, 2018 at 1:24 AM, kant kodali  wrote:
>
>> Hi All,
>>
>> I was under an assumption that one needs to run grouby(window(...)) to
>> run any stateful operations but looks like that is not the case since any
>> aggregation like query
>>
>> "select count(*) from some_view"  is also stateful since it stores the
>> result of the count from the previous batch. Likewise, if I do
>>
>> "select collect_list(*) from some_view" with say maxOffsetsTrigger set to
>> 1 I can see the rows from the previous batch at every trigger.
>>
>> so is it fair to say aggregations by default are stateful?
>>
>> I am looking more like DStream like an approach(stateless) where I want
>> to collect bunch of records on each batch do some aggregation like say
>> count and throw the result out and next batch it should only count from
>> that batch only but not from the previous batch.
>>
>> so If I run "select collect_list(*) from some_view" I want to collect
>> whatever rows are available at each batch/trigger but not from the previous
>> batch. How do I do that?
>>
>> Thanks!
>>
>
>