Re: unsubscribe

2016-09-27 Thread Daniel Lopes
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | http://www.daniellopes.com.br

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Mon, Sep 26, 2016 at 12:24 PM, Karthikeyan Vasuki Balasubramaniam <
kvasu...@eng.ucsd.edu> wrote:

> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unsubscribe

2016-09-14 Thread Daniel Lopes
Hi Chang,

just send a e-mail to user-unsubscr...@spark.apache.org

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Tue, Sep 13, 2016 at 12:38 AM, ChangMingMin(常明敏) <
chang_ming...@founder.com> wrote:

> unsubscribe
>


Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-13 Thread Daniel Lopes
Hi Mario,

Thanks for your help, so I will keeping using CSVs

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Mon, Sep 12, 2016 at 3:39 PM, Mario Ds Briggs <mario.bri...@in.ibm.com>
wrote:

> Daniel,
>
> I believe it is related to https://issues.apache.org/
> jira/browse/SPARK-13979 and happens only when task fails in a executor
> (probably for some other reason u hit the latter in parquet and not csv).
>
> The PR in there, should be shortly available in IBM's Analytics for Spark.
>
>
> thanks
> Mario
>
> [image: Inactive hide details for Adam Roberts---12/09/2016 09:37:21
> pm---Mario, incase you've not seen this...]Adam Roberts---12/09/2016
> 09:37:21 pm---Mario, incase you've not seen this...
>
> From: Adam Roberts/UK/IBM
> To: Mario Ds Briggs/India/IBM@IBMIN
> Date: 12/09/2016 09:37 pm
> Subject: Fw: Spark + Parquet + IBM Block Storage at Bluemix
> --
>
>
> Mario, incase you've not seen this...
>
> --
> *Adam Roberts*
> IBM Spark Team Lead
> Runtime Technologies - Hursley
> - Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -
>
> From: Daniel Lopes <dan...@onematch.com.br>
> To: Steve Loughran <ste...@hortonworks.com>
> Cc: user <user@spark.apache.org>
> Date: 12/09/2016 13:05
> Subject: Re: Spark + Parquet + IBM Block Storage at Bluemix
> --
>
>
>
> Thanks Steve,
>
> But this error occurs only with parquet files, CSVs works.
>
> Best,
>
> *Daniel Lopes*
> Chief Data and Analytics Officer | OneMatch
> c: +55 (18) 99764-2733 | *https://www.linkedin.com/in/dslopes*
> <https://www.linkedin.com/in/dslopes>
>
> *www.onematch.com.br*
> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
>
> On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <*ste...@hortonworks.com*
> <ste...@hortonworks.com>> wrote:
>
>On 9 Sep 2016, at 17:56, Daniel Lopes <*dan...@onematch.com.br*
>  <dan...@onematch.com.br>> wrote:
>
>  Hi, someone can help
>
>  I'm trying to use parquet in IBM Block Storage at Spark but when
>  I try to load get this error:
>
>  using this config
>
>  credentials = {
>"name": "keystone",
>*"auth_url": "**https://identity.open.softlayer.com*
>  <https://identity.open.softlayer.com/>*",*
>"project": "object_storage_23f274c1_d11XXXe634",
>"projectId": "XXd9c4aa39b7c7eb",
>"region": "dallas",
>"userId": "X64087180b40X2b909",
>"username": "admin_9dd810f8901d48778XX",
>"password": "chX6_",
>"domainId": "c1ddad17cfcX41",
>"domainName": "10XX",
>"role": "admin"
>  }
>
>  def set_hadoop_config(credentials):
>  """This function sets the Hadoop configuration with given
>  credentials,
>  so it is possible to access data using SparkContext"""
>
>  prefix = "fs.swift.service." + credentials['name']
>  hconf = sc._jsc.hadoopConfiguration()
>  *hconf.set(prefix + ".auth.url",
>  credentials['auth_url']+'/v3/auth/tokens')*
>  hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
>  hconf.set(prefix + ".tenant", credentials['projectId'])
>  hconf.set(prefix + ".username", credentials['userId'])
>  hconf.set(prefix + ".password", credentials['password'])
>  hconf.setInt(prefix + ".http.port", 8080)
>  hconf.set(prefix + ".region", credentials['region'])
>  hconf.setBoolean(prefix + ".public", True)
>
>  set_hadoop_config(credentials)
>
>  -
>
>  Py4JJavaErrorTraceback (most recent call last)
>   in ()
>  > 1 train.groupby('Acordo').count().show()
>
>  *Py4JJavaError: An error occurred while calling* o406.showString.
>  : org.apache.spark.Spark

Re: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-12 Thread Daniel Lopes
Thanks Steve,

But this error occurs only with parquet files, CSVs works.

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 9 Sep 2016, at 17:56, Daniel Lopes <dan...@onematch.com.br> wrote:
>
> Hi, someone can help
>
> I'm trying to use parquet in IBM Block Storage at Spark but when I try to
> load get this error:
>
> using this config
>
> credentials = {
>   "name": "keystone",
>   *"auth_url": "https://identity.open.softlayer.com
> <https://identity.open.softlayer.com/>",*
>   "project": "object_storage_23f274c1_d11XXXe634",
>   "projectId": "XXd9c4aa39b7c7eb",
>   "region": "dallas",
>   "userId": "X64087180b40X2b909",
>   "username": "admin_9dd810f8901d48778XX",
>   "password": "chX6_",
>   "domainId": "c1ddad17cfcX41",
>   "domainName": "10XX",
>   "role": "admin"
> }
>
> def set_hadoop_config(credentials):
> """This function sets the Hadoop configuration with given credentials,
> so it is possible to access data using SparkContext"""
>
> prefix = "fs.swift.service." + credentials['name']
> hconf = sc._jsc.hadoopConfiguration()
> *hconf.set(prefix + ".auth.url",
> credentials['auth_url']+'/v3/auth/tokens')*
> hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
> hconf.set(prefix + ".tenant", credentials['projectId'])
> hconf.set(prefix + ".username", credentials['userId'])
> hconf.set(prefix + ".password", credentials['password'])
> hconf.setInt(prefix + ".http.port", 8080)
> hconf.set(prefix + ".region", credentials['region'])
> hconf.setBoolean(prefix + ".public", True)
>
> set_hadoop_config(credentials)
>
> -
>
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 train.groupby('Acordo').count().show()
>
> *Py4JJavaError: An error occurred while calling* o406.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in
> stage 30.0 (TID 2556, yp-spark-dal09-env5-0039): org.apache.hadoop.fs.swift.
> exceptions.SwiftConfigurationException:* Missing mandatory configuration
> option: fs.swift.service.keystone.auth.url*
>
>
>
> In my own code, I'd assume that the value of credentials['name'] didn't
> match that of the URL, assuming you have something like
> swift://bucket.keystone . Failing that: the options were set too late.
>
> Instead of asking for the hadoop config and editing that, set the option
> in your spark context, before it is launched, with the prefix "hadoop"
>
>
> at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(
> RestClientBindings.java:223)
> at org.apache.hadoop.fs.swift.http.RestClientBindings.bind(
> RestClientBindings.java:147)
>
>
> *Daniel Lopes*
> Chief Data and Analytics Officer | OneMatch
> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes
>
> www.onematch.com.br
> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
>
>
>


Spark + Parquet + IBM Block Storage at Bluemix

2016-09-09 Thread Daniel Lopes
Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark but when I try to
load get this error:

using this config

credentials = {
  "name": "keystone",
  *"auth_url": "https://identity.open.softlayer.com
<https://identity.open.softlayer.com>",*
  "project": "object_storage_23f274c1_d11XXXe634",
  "projectId": "XXd9c4aa39b7c7eb",
  "region": "dallas",
  "userId": "X64087180b40X2b909",
  "username": "admin_9dd810f8901d48778XX",
  "password": "chX6_",
  "domainId": "c1ddad17cfcX41",
  "domainName": "10XX",
  "role": "admin"
}

def set_hadoop_config(credentials):
"""This function sets the Hadoop configuration with given credentials,
so it is possible to access data using SparkContext"""

prefix = "fs.swift.service." + credentials['name']
hconf = sc._jsc.hadoopConfiguration()
*hconf.set(prefix + ".auth.url",
credentials['auth_url']+'/v3/auth/tokens')*
hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
hconf.set(prefix + ".tenant", credentials['projectId'])
hconf.set(prefix + ".username", credentials['userId'])
hconf.set(prefix + ".password", credentials['password'])
hconf.setInt(prefix + ".http.port", 8080)
hconf.set(prefix + ".region", credentials['region'])
hconf.setBoolean(prefix + ".public", True)

set_hadoop_config(credentials)

-

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 train.groupby('Acordo').count().show()

*Py4JJavaError: An error occurred while calling* o406.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in
stage 30.0 (TID 2556, yp-spark-dal09-env5-0039):
org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException:* Missing
mandatory configuration option: fs.swift.service.keystone.auth.url*
at
org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:223)
at
org.apache.hadoop.fs.swift.http.RestClientBindings.bind(RestClientBindings.java:147)


*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>


Re: year out of range

2016-09-09 Thread Daniel Lopes
Thanks Ayan!

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Thu, Sep 8, 2016 at 7:54 PM, ayan guha <guha.a...@gmail.com> wrote:

> Another way of debugging would be writing another UDF, returning string.
> Also, in that function, put something useful in catch block, so you can
> filter those records from df.
> On 9 Sep 2016 03:41, "Daniel Lopes" <dan...@onematch.com.br> wrote:
>
>> Thanks Mike,
>>
>> A good way to debug! Was that already!
>>
>> Best,
>>
>> *Daniel Lopes*
>> Chief Data and Analytics Officer | OneMatch
>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes
>>
>> www.onematch.com.br
>> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
>>
>> On Thu, Sep 8, 2016 at 2:26 PM, Mike Metzger <m...@flexiblecreations.com>
>> wrote:
>>
>>> My guess is there's some row that does not match up with the expected
>>> data.  While slower, I've found RDDs to be easier to troubleshoot this kind
>>> of thing until you sort out exactly what's happening.
>>>
>>> Something like:
>>>
>>> raw_data = sc.textFile("")
>>> rowcounts = raw_data.map(lambda x: (len(x.split(",")),
>>> 1)).reduceByKey(lambda x,y: x+y)
>>> rowcounts.take(5)
>>>
>>> badrows = raw_data.filter(lambda x: len(x.split(",")) != >> number of columns>)
>>> if badrows.count() > 0:
>>> badrows.saveAsTextFile("")
>>>
>>>
>>> You should be able to tell if there are any rows with column counts that
>>> don't match up (the thing that usually bites me with CSV conversions).
>>> Assuming these all match to what you want, I'd try mapping the unparsed
>>> date column out to separate fields and try to see if a year field isn't
>>> matching the expected values.
>>>
>>> Thanks
>>>
>>> Mike
>>>
>>>
>>> On Thu, Sep 8, 2016 at 8:15 AM, Daniel Lopes <dan...@onematch.com.br>
>>> wrote:
>>>
>>>> Thanks,
>>>>
>>>> I *tested* the function offline and works
>>>> Tested too with select * from after convert the data and see the new
>>>> data good
>>>> *but* if I *register as temp table* to *join other table* stilll shows *the
>>>> same error*.
>>>>
>>>> ValueError: year out of range
>>>>
>>>> Best,
>>>>
>>>> *Daniel Lopes*
>>>> Chief Data and Analytics Officer | OneMatch
>>>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes
>>>>
>>>> www.onematch.com.br
>>>> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
>>>>
>>>> On Thu, Sep 8, 2016 at 9:43 AM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Daniel
>>>>> Test the parse date offline to make sure it returns what you expect
>>>>> If it does   in spark shell create a df with 1 row only and run ur
>>>>> UDF. U should b able to see issue
>>>>> If not send me a reduced CSV file at my email and I give it a try this
>>>>> eve hopefully someone else will b able to assist in meantime
>>>>> U don't need to run a full spark app to debug issue
>>>>> Ur problem. Is either in the parse date or in what gets passed to the
>>>>> UDF
>>>>> Hth
>>>>>
>>>>> On 8 Sep 2016 1:31 pm, "Daniel Lopes" <dan...@onematch.com.br> wrote:
>>>>>
>>>>>> Thanks Marco for your response.
>>>>>>
>>>>>> The field came encoded by SQL Server in locale pt_BR.
>>>>>>
>>>>>> The code that I am formating is:
>>>>>>
>>>>>> --
>>>>>> def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'):
>>>>>> try:
>>>>>> locale.setlocale(locale.LC_TIME, 'pt_BR.utf8')
>>>>>> return datetime.strptime(argument, format_date)
>>>>>> except:
>>>>>> return None
>>>>>&

Re: year out of range

2016-09-08 Thread Daniel Lopes
Thanks Mike,

A good way to debug! Was that already!

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Thu, Sep 8, 2016 at 2:26 PM, Mike Metzger <m...@flexiblecreations.com>
wrote:

> My guess is there's some row that does not match up with the expected
> data.  While slower, I've found RDDs to be easier to troubleshoot this kind
> of thing until you sort out exactly what's happening.
>
> Something like:
>
> raw_data = sc.textFile("")
> rowcounts = raw_data.map(lambda x: (len(x.split(",")),
> 1)).reduceByKey(lambda x,y: x+y)
> rowcounts.take(5)
>
> badrows = raw_data.filter(lambda x: len(x.split(",")) !=  of columns>)
> if badrows.count() > 0:
> badrows.saveAsTextFile("")
>
>
> You should be able to tell if there are any rows with column counts that
> don't match up (the thing that usually bites me with CSV conversions).
> Assuming these all match to what you want, I'd try mapping the unparsed
> date column out to separate fields and try to see if a year field isn't
> matching the expected values.
>
> Thanks
>
> Mike
>
>
> On Thu, Sep 8, 2016 at 8:15 AM, Daniel Lopes <dan...@onematch.com.br>
> wrote:
>
>> Thanks,
>>
>> I *tested* the function offline and works
>> Tested too with select * from after convert the data and see the new data
>> good
>> *but* if I *register as temp table* to *join other table* stilll shows *the
>> same error*.
>>
>> ValueError: year out of range
>>
>> Best,
>>
>> *Daniel Lopes*
>> Chief Data and Analytics Officer | OneMatch
>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes
>>
>> www.onematch.com.br
>> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
>>
>> On Thu, Sep 8, 2016 at 9:43 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Daniel
>>> Test the parse date offline to make sure it returns what you expect
>>> If it does   in spark shell create a df with 1 row only and run ur UDF.
>>> U should b able to see issue
>>> If not send me a reduced CSV file at my email and I give it a try this
>>> eve hopefully someone else will b able to assist in meantime
>>> U don't need to run a full spark app to debug issue
>>> Ur problem. Is either in the parse date or in what gets passed to the UDF
>>> Hth
>>>
>>> On 8 Sep 2016 1:31 pm, "Daniel Lopes" <dan...@onematch.com.br> wrote:
>>>
>>>> Thanks Marco for your response.
>>>>
>>>> The field came encoded by SQL Server in locale pt_BR.
>>>>
>>>> The code that I am formating is:
>>>>
>>>> --
>>>> def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'):
>>>> try:
>>>> locale.setlocale(locale.LC_TIME, 'pt_BR.utf8')
>>>> return datetime.strptime(argument, format_date)
>>>> except:
>>>> return None
>>>>
>>>> convert_date = funcspk.udf(lambda x: parse_date(x, '%b %d %Y %H:%M'),
>>>> TimestampType())
>>>>
>>>> transacoes = transacoes.withColumn('tr_Vencimento',
>>>> convert_date(transacoes.*tr_Vencimento*))
>>>>
>>>> --
>>>>
>>>> the sample is
>>>>
>>>> -
>>>> +-++-+--
>>>> --+--+---+-+
>>>> -+--+--+
>>>> +-+-+--+
>>>> +++-
>>>> ---+--+++--+
>>>> -+--+
>>>> |tr_NumeroContrato|tr_TipoDocumento|*tr_Vencimento*|tr_Valor|tr_Dat
>>>> aRecebimento|tr_TaxaMora|tr_DescontoMaximo|tr_DescontoMaximo
>>>> Corr|tr_ValorAtualizado|tr_ComGarantia|tr_ValorDesconto|tr_V
>>>> alorJuros|tr_ValorMulta|tr_DataDevolucaoCheque|tr_ValorCorrigidoContratante|
>>>>  tr_DataNotificacao|tr_Banco|tr_Praca|tr_DescricaoAlinea|tr_
>>>> Enquadramento|tr_Linha|tr_Arquivo|tr_DataImportacao|tr_Agencia|
>>>> +-++-

Re: year out of range

2016-09-08 Thread Daniel Lopes
Thanks,

I *tested* the function offline and works
Tested too with select * from after convert the data and see the new data
good
*but* if I *register as temp table* to *join other table* stilll shows *the
same error*.

ValueError: year out of range

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Thu, Sep 8, 2016 at 9:43 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Daniel
> Test the parse date offline to make sure it returns what you expect
> If it does   in spark shell create a df with 1 row only and run ur UDF. U
> should b able to see issue
> If not send me a reduced CSV file at my email and I give it a try this eve
> hopefully someone else will b able to assist in meantime
> U don't need to run a full spark app to debug issue
> Ur problem. Is either in the parse date or in what gets passed to the UDF
> Hth
>
> On 8 Sep 2016 1:31 pm, "Daniel Lopes" <dan...@onematch.com.br> wrote:
>
>> Thanks Marco for your response.
>>
>> The field came encoded by SQL Server in locale pt_BR.
>>
>> The code that I am formating is:
>>
>> --
>> def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'):
>> try:
>> locale.setlocale(locale.LC_TIME, 'pt_BR.utf8')
>> return datetime.strptime(argument, format_date)
>> except:
>> return None
>>
>> convert_date = funcspk.udf(lambda x: parse_date(x, '%b %d %Y %H:%M'),
>> TimestampType())
>>
>> transacoes = transacoes.withColumn('tr_Vencimento',
>> convert_date(transacoes.*tr_Vencimento*))
>>
>> --
>>
>> the sample is
>>
>> -
>> +-++-+--
>> --+--+---+-+
>> -+--+--+
>> +-+-+--+
>> +++-
>> ---+--+++--+
>> -+--+
>> |tr_NumeroContrato|tr_TipoDocumento|*tr_Vencimento*|tr_Valor|tr_Dat
>> aRecebimento|tr_TaxaMora|tr_DescontoMaximo|tr_DescontoMaxi
>> moCorr|tr_ValorAtualizado|tr_ComGarantia|tr_ValorDesconto|tr
>> _ValorJuros|tr_ValorMulta|tr_DataDevolucaoCheque|tr_ValorCorrigidoContratante|
>>  tr_DataNotificacao|tr_Banco|tr_Praca|tr_DescricaoAlinea|tr_
>> Enquadramento|tr_Linha|tr_Arquivo|tr_DataImportacao|tr_Agencia|
>> +-++-+--
>> --+--+---+-+
>> -+--+--+
>> +-+-+--+
>> +++-
>> ---+--+++--+
>> -+--+
>> | 992600153001||*Jul 20 2015 12:00*|  254.35|
>>null|   null| null| null|
>>null| 0|null| null| null|
>>null|  254.35|2015-07-20 12:00:...|null|
>>null|  null|null|null|  null|
>>   null|  null|
>> | 992600153001||*Abr 20 2015 12:00*|  254.35|
>>null|   null| null| null|
>>null| 0|null| null| null|
>>null|  254.35|null|null|
>>null|  null|null|null|  null|
>>   null|  null|
>> | 992600153001||Nov 20 2015 12:00|  254.35|
>>null|   null| null| null|
>>  null| 0|null| null| null|
>>  null|  254.35|2015-11-20 12:00:...|null|
>>  null|  null|null|null|  null|
>> null|  null|
>> | 992600153001||Dez 20 2015 12:00|  254.35|
>>null|   null| null| null|
>>  null| 0|null| null| null|
>>  null|  254.35|null|null|
>>  null|  null|null|null|  null|
>> null|  null|
>> | 

Re: year out of range

2016-09-08 Thread Daniel Lopes
| null|
 null| 0|null| null| null|
 null|  254.35|2015-01-20 12:00:...|null|
 null|  null|null|null|  null|
null|  null|
| 992600153001||Set 20 2015 12:00|  254.35|
 null|   null| null| null|
 null| 0|null| null| null|
 null|  254.35|null|null|
 null|  null|null|null|  null|
null|  null|
| 992600153001||Mai 20 2015 12:00|  254.35|
 null|   null| null| null|
 null| 0|null| null| null|
 null|  254.35|null|null|
 null|  null|null|null|  null|
null|  null|
| 992600153001||Out 20 2015 12:00|  254.35|
 null|   null| null| null|
 null| 0|null| null| null|
 null|  254.35|null|null|
 null|  null|null|null|  null|
null|  null|
| 992600153001||Mar 20 2015 12:00|  254.35|
 null|   null| null| null|
 null| 0|null| null| null|
 null|  254.35|2015-03-20 12:00:...|null|
 null|  null|null|null|  null|
null|  null|
+-++-++--+---+-+-+--+--++-+-+--+++++--+++--+-+--+

-

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Thu, Sep 8, 2016 at 5:33 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Pls paste code and sample CSV
> I m guessing it has to do with formatting time?
> Kr
>
> On 8 Sep 2016 12:38 am, "Daniel Lopes" <dan...@onematch.com.br> wrote:
>
>> Hi,
>>
>> I'm* importing a few CSV*s with spark-csv package,
>> Always when I give a select at each one looks ok
>> But when i join then with sqlContext.sql give me this error
>>
>> all tables has fields timestamp
>>
>> joins are not with this dates
>>
>>
>> *Py4JJavaError: An error occurred while calling o643.showString.*
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 54 in stage 92.0 failed 10 times, most recent failure: Lost task 54.9
>> in stage 92.0 (TID 6356, yp-spark-dal09-env5-0036):
>> org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):
>>   File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/
>> lib/pyspark.zip/pyspark/worker.py", line 111, in main
>> process()
>>   File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/
>> lib/pyspark.zip/pyspark/worker.py", line 106, in process
>> serializer.dump_stream(func(split_index, iterator), outfile)
>>   File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/
>> lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
>> vs = list(itertools.islice(iterator, batch))
>>   File "/usr/local/src/spark160master/spark/python/pyspark/sql/functions.py",
>> line 1563, in 
>> func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
>>   File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/
>> lib/pyspark.zip/pyspark/sql/types.py", line 191, in toInternal
>> else time.mktime(dt.timetuple()))
>> *ValueError: year out of range  *
>>
>> Any one knows this problem?
>>
>> Best,
>>
>> *Daniel Lopes*
>> Chief Data and Analytics Officer | OneMatch
>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes
>>
>> www.onematch.com.br
>> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
>>
>


year out of range

2016-09-07 Thread Daniel Lopes
Hi,

I'm* importing a few CSV*s with spark-csv package,
Always when I give a select at each one looks ok
But when i join then with sqlContext.sql give me this error

all tables has fields timestamp

joins are not with this dates


*Py4JJavaError: An error occurred while calling o643.showString.*
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
54 in stage 92.0 failed 10 times, most recent failure: Lost task 54.9 in
stage 92.0 (TID 6356, yp-spark-dal09-env5-0036):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/worker.py",
line 111, in main
process()
  File
"/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File
"/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/serializers.py",
line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File
"/usr/local/src/spark160master/spark/python/pyspark/sql/functions.py", line
1563, in 
func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
  File
"/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/sql/types.py",
line 191, in toInternal
else time.mktime(dt.timetuple()))
*ValueError: year out of range  *

Any one knows this problem?

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>


Re: unsubscribe

2016-08-03 Thread Daniel Lopes
please send to user-unsubscr...@spark.apache.org

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>

On Tue, Aug 2, 2016 at 10:11 AM, <doovs...@sina.com> wrote:

> unsubscribe
>
>
> 
>
> ZhangYi (张逸)
>
> BigEye
>
> website: http://www.bigeyedata.com
>
> blog: http://zhangyi.farbox.com
>
> tel: 15023157626
>
>
>
>
> - 原始邮件 -
> 发件人:"zhangjp" <592426...@qq.com>
> 收件人:"user" <user@spark.apache.org>
> 主题:unsubscribe
> 日期:2016年08月02日 11点00分
>
> unsubscribe
>


Check out Kyper! Trying to be Uber of Data

2016-07-25 Thread Daniel Lopes
I just signed up for Kyper and thought you might be interested, too!

http://l.aunch.us/L7Ezb


Re: unsubscribe)

2016-07-25 Thread Daniel Lopes
Hi Uzi,

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br
<http://www.onematch.com.br/?pk_campaign=EmailSignature_kwd=daniel-lopes>

On Mon, Jul 25, 2016 at 2:36 AM, Uzi Hadad <uziha...@mta.ac.il> wrote:

>
>


Re: Scala VS Java VS Python

2015-12-16 Thread Daniel Lopes
For me Scala is better like Spark is written in Scala, and I like python
cuz I always used python for data science. :)

On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia <h...@danielvaldivia.com>
wrote:

> Hello,
>
> This is more of a "survey" question for the community, you can reply to me
> directly so we don't flood the mailing list.
>
> I'm having a hard time learning Spark using Python since the API seems to
> be slightly incomplete, so I'm looking at my options to start doing all my
> apps in either Scala or Java, being a Java Developer, java 1.8 looks like
> the logical way, however I'd like to ask here what's the most common (Scala
> Or Java) since I'm observing mixed results in the social documentation,
> however Scala seems to be the predominant language for spark examples.
>
> Thank for the advice
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
*Daniel Lopes, B.Eng*
Data Scientist - BankFacil
CREA/SP 5069410560
<http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334>
Mob +55 (18) 99764-2733 <callto:+5518997642733>
Ph +55 (11) 3522-8009
http://about.me/dannyeuu

Av. Nova Independência, 956, São Paulo, SP
Bairro Brooklin Paulista
CEP 04570-001
https://www.bankfacil.com.br


Spark 1.5.2 + Hive 1.0.0 in Amazon EMR 4.2.0

2015-11-30 Thread Daniel Lopes
Hi,

I get this error when trying to write Spark DataFrame to Hive Table Stored
as TextFile


sqlContext.sql('INSERT OVERWRITE TABLE analytics.client_view_stock *(hive
table)* SELECT * FROM client_view_stock'*(spark temp table)*')

Erro:

15/11/30 21:40:14 INFO latency: StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 5ADBECA2D82A7C17), S3 Extended Request ID:
RcPfjgWaeXG62xyVRrAr91sVQNxktqbXUPJgK2cvZlf6SKEAOnWCtV9X9K1Vp9dAyDhGALQRBcU=],
ServiceName=*[Amazon S3], AWSErrorCode=[404 Not Found]*,
AWSRequestID=[5ADBECA2D82A7C17], ServiceEndpoint=[
https://my-bucket.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[214.69],
HttpRequestTime=[214.245], HttpClientReceiveResponseTime=[212.513],
RequestSigningTime=[0.16], HttpClientSendRequestTime=[0.112],
15/11/30 21:40:21 INFO Hive: Replacing
src:s3://my-bucket/output/2015/11/29/client_view_stock/.hive-staging_hive_2015-11-30_21-19-48_942_238078420083598647-1/-ext-1/part-00199,
dest: s3://my-bucket/output/2015/11/29/client_view_stock/part-00199,
Status:true
-chgrp: '' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
15/11/30 21:40:21 INFO latency: StatusCode=[200], ServiceName=[Amazon S3],
AWSRequestID=[2509AE55A8D71A61], ServiceEndpoint=[https://my-bucket.
s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1,
HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1,
ClientExecuteTime=[137.387], HttpRequestTime=[136.721],
HttpClientReceiveResponseTime=[134.805], RequestSigningTime=[0.235],
ResponseProcessingTime=[0.169], HttpClientSendRequestTime=[0.145],
15/11/30 21:40:21 WARN RetryingMetaStoreClient: MetaStoreClient lost
connection. Attempting to reconnect.
org.apache.thrift.TApplication*Exception: Invalid method name:
'alter_table_with_cascade'*

Thanks!

-- 
*Daniel Lopes, B.Eng*
Data Scientist - BankFacil
CREA/SP 5069410560
<http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334>
Mob +55 (18) 99764-2733 <callto:+5518997642733>
Ph +55 (11) 3522-8009
http://about.me/dannyeuu

Av. Nova Independência, 956, São Paulo, SP
Bairro Brooklin Paulista
CEP 04570-001
https://www.bankfacil.com.br


Re: UDF with 2 arguments

2015-11-26 Thread Daniel Lopes
Thanks Davies and Nathan,

I found my error.

I was using *ArrayType()* and I need to pass de kind of type has in this
array and I has not passing *ArrayType(IntegerType())*.

Thanks :)

On Wed, Nov 25, 2015 at 7:46 PM, Davies Liu <dav...@databricks.com> wrote:

> It works in master (1.6), what's the version of Spark you have?
>
> >>> from pyspark.sql.functions import udf
> >>> def f(a, b): pass
> ...
> >>> my_udf = udf(f)
> >>> from pyspark.sql.types import *
> >>> my_udf = udf(f, IntegerType())
>
>
> On Wed, Nov 25, 2015 at 12:01 PM, Daniel Lopes <dan...@bankfacil.com.br>
> wrote:
> > Hallo,
> >
> > supose I have function in pyspark that
> >
> > def function(arg1,arg2):
> >   pass
> >
> > and
> >
> > udf_function = udf(function, IntegerType())
> >
> > that takes me error
> >
> > Traceback (most recent call last):
> >   File "", line 1, in 
> > TypeError: __init__() takes at least 2 arguments (1 given)
> >
> >
> > How I use?
> >
> > Best,
> >
> >
> > --
> > Daniel Lopes, B.Eng
> > Data Scientist - BankFacil
> > CREA/SP 5069410560
> > Mob +55 (18) 99764-2733
> > Ph +55 (11) 3522-8009
> > http://about.me/dannyeuu
> >
> > Av. Nova Independência, 956, São Paulo, SP
> > Bairro Brooklin Paulista
> > CEP 04570-001
> > https://www.bankfacil.com.br
> >
>



-- 
*Daniel Lopes, B.Eng*
Data Scientist - BankFacil
CREA/SP 5069410560
<http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334>
Mob +55 (18) 99764-2733 <callto:+5518997642733>
Ph +55 (11) 3522-8009
http://about.me/dannyeuu

Av. Nova Independência, 956, São Paulo, SP
Bairro Brooklin Paulista
CEP 04570-001
https://www.bankfacil.com.br


spark-csv on Amazon EMR

2015-11-23 Thread Daniel Lopes
Hi,

Some know how to use spark-csv in create-cluster statement of Amazon EMR
CLI?

Best,

-- 
*Daniel Lopes, B.Eng*
Data Scientist - BankFacil
CREA/SP 5069410560
<http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334>
Mob +55 (18) 99764-2733 <callto:+5518997642733>
Ph +55 (11) 3522-8009
http://about.me/dannyeuu

Av. Nova Independência, 956, São Paulo, SP
Bairro Brooklin Paulista
CEP 04570-001
https://www.bankfacil.com.br