Re: unsubscribe
To unsubscribe e-mail: user-unsubscr...@spark.apache.org *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | http://www.daniellopes.com.br www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Mon, Sep 26, 2016 at 12:24 PM, Karthikeyan Vasuki Balasubramaniam < kvasu...@eng.ucsd.edu> wrote: > unsubscribe > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: unsubscribe
Hi Chang, just send a e-mail to user-unsubscr...@spark.apache.org Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Tue, Sep 13, 2016 at 12:38 AM, ChangMingMin(常明敏) < chang_ming...@founder.com> wrote: > unsubscribe >
Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix
Hi Mario, Thanks for your help, so I will keeping using CSVs Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Mon, Sep 12, 2016 at 3:39 PM, Mario Ds Briggs <mario.bri...@in.ibm.com> wrote: > Daniel, > > I believe it is related to https://issues.apache.org/ > jira/browse/SPARK-13979 and happens only when task fails in a executor > (probably for some other reason u hit the latter in parquet and not csv). > > The PR in there, should be shortly available in IBM's Analytics for Spark. > > > thanks > Mario > > [image: Inactive hide details for Adam Roberts---12/09/2016 09:37:21 > pm---Mario, incase you've not seen this...]Adam Roberts---12/09/2016 > 09:37:21 pm---Mario, incase you've not seen this... > > From: Adam Roberts/UK/IBM > To: Mario Ds Briggs/India/IBM@IBMIN > Date: 12/09/2016 09:37 pm > Subject: Fw: Spark + Parquet + IBM Block Storage at Bluemix > -- > > > Mario, incase you've not seen this... > > -- > *Adam Roberts* > IBM Spark Team Lead > Runtime Technologies - Hursley > - Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 - > > From: Daniel Lopes <dan...@onematch.com.br> > To: Steve Loughran <ste...@hortonworks.com> > Cc: user <user@spark.apache.org> > Date: 12/09/2016 13:05 > Subject: Re: Spark + Parquet + IBM Block Storage at Bluemix > -- > > > > Thanks Steve, > > But this error occurs only with parquet files, CSVs works. > > Best, > > *Daniel Lopes* > Chief Data and Analytics Officer | OneMatch > c: +55 (18) 99764-2733 | *https://www.linkedin.com/in/dslopes* > <https://www.linkedin.com/in/dslopes> > > *www.onematch.com.br* > <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> > > On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <*ste...@hortonworks.com* > <ste...@hortonworks.com>> wrote: > >On 9 Sep 2016, at 17:56, Daniel Lopes <*dan...@onematch.com.br* > <dan...@onematch.com.br>> wrote: > > Hi, someone can help > > I'm trying to use parquet in IBM Block Storage at Spark but when > I try to load get this error: > > using this config > > credentials = { >"name": "keystone", >*"auth_url": "**https://identity.open.softlayer.com* > <https://identity.open.softlayer.com/>*",* >"project": "object_storage_23f274c1_d11XXXe634", >"projectId": "XXd9c4aa39b7c7eb", >"region": "dallas", >"userId": "X64087180b40X2b909", >"username": "admin_9dd810f8901d48778XX", >"password": "chX6_", >"domainId": "c1ddad17cfcX41", >"domainName": "10XX", >"role": "admin" > } > > def set_hadoop_config(credentials): > """This function sets the Hadoop configuration with given > credentials, > so it is possible to access data using SparkContext""" > > prefix = "fs.swift.service." + credentials['name'] > hconf = sc._jsc.hadoopConfiguration() > *hconf.set(prefix + ".auth.url", > credentials['auth_url']+'/v3/auth/tokens')* > hconf.set(prefix + ".auth.endpoint.prefix", "endpoints") > hconf.set(prefix + ".tenant", credentials['projectId']) > hconf.set(prefix + ".username", credentials['userId']) > hconf.set(prefix + ".password", credentials['password']) > hconf.setInt(prefix + ".http.port", 8080) > hconf.set(prefix + ".region", credentials['region']) > hconf.setBoolean(prefix + ".public", True) > > set_hadoop_config(credentials) > > - > > Py4JJavaErrorTraceback (most recent call last) > in () > > 1 train.groupby('Acordo').count().show() > > *Py4JJavaError: An error occurred while calling* o406.showString. > : org.apache.spark.Spark
Re: Spark + Parquet + IBM Block Storage at Bluemix
Thanks Steve, But this error occurs only with parquet files, CSVs works. Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 9 Sep 2016, at 17:56, Daniel Lopes <dan...@onematch.com.br> wrote: > > Hi, someone can help > > I'm trying to use parquet in IBM Block Storage at Spark but when I try to > load get this error: > > using this config > > credentials = { > "name": "keystone", > *"auth_url": "https://identity.open.softlayer.com > <https://identity.open.softlayer.com/>",* > "project": "object_storage_23f274c1_d11XXXe634", > "projectId": "XXd9c4aa39b7c7eb", > "region": "dallas", > "userId": "X64087180b40X2b909", > "username": "admin_9dd810f8901d48778XX", > "password": "chX6_", > "domainId": "c1ddad17cfcX41", > "domainName": "10XX", > "role": "admin" > } > > def set_hadoop_config(credentials): > """This function sets the Hadoop configuration with given credentials, > so it is possible to access data using SparkContext""" > > prefix = "fs.swift.service." + credentials['name'] > hconf = sc._jsc.hadoopConfiguration() > *hconf.set(prefix + ".auth.url", > credentials['auth_url']+'/v3/auth/tokens')* > hconf.set(prefix + ".auth.endpoint.prefix", "endpoints") > hconf.set(prefix + ".tenant", credentials['projectId']) > hconf.set(prefix + ".username", credentials['userId']) > hconf.set(prefix + ".password", credentials['password']) > hconf.setInt(prefix + ".http.port", 8080) > hconf.set(prefix + ".region", credentials['region']) > hconf.setBoolean(prefix + ".public", True) > > set_hadoop_config(credentials) > > - > > Py4JJavaErrorTraceback (most recent call last) > in () > > 1 train.groupby('Acordo').count().show() > > *Py4JJavaError: An error occurred while calling* o406.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in > stage 30.0 (TID 2556, yp-spark-dal09-env5-0039): org.apache.hadoop.fs.swift. > exceptions.SwiftConfigurationException:* Missing mandatory configuration > option: fs.swift.service.keystone.auth.url* > > > > In my own code, I'd assume that the value of credentials['name'] didn't > match that of the URL, assuming you have something like > swift://bucket.keystone . Failing that: the options were set too late. > > Instead of asking for the hadoop config and editing that, set the option > in your spark context, before it is launched, with the prefix "hadoop" > > > at org.apache.hadoop.fs.swift.http.RestClientBindings.copy( > RestClientBindings.java:223) > at org.apache.hadoop.fs.swift.http.RestClientBindings.bind( > RestClientBindings.java:147) > > > *Daniel Lopes* > Chief Data and Analytics Officer | OneMatch > c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes > > www.onematch.com.br > <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> > > >
Spark + Parquet + IBM Block Storage at Bluemix
Hi, someone can help I'm trying to use parquet in IBM Block Storage at Spark but when I try to load get this error: using this config credentials = { "name": "keystone", *"auth_url": "https://identity.open.softlayer.com <https://identity.open.softlayer.com>",* "project": "object_storage_23f274c1_d11XXXe634", "projectId": "XXd9c4aa39b7c7eb", "region": "dallas", "userId": "X64087180b40X2b909", "username": "admin_9dd810f8901d48778XX", "password": "chX6_", "domainId": "c1ddad17cfcX41", "domainName": "10XX", "role": "admin" } def set_hadoop_config(credentials): """This function sets the Hadoop configuration with given credentials, so it is possible to access data using SparkContext""" prefix = "fs.swift.service." + credentials['name'] hconf = sc._jsc.hadoopConfiguration() *hconf.set(prefix + ".auth.url", credentials['auth_url']+'/v3/auth/tokens')* hconf.set(prefix + ".auth.endpoint.prefix", "endpoints") hconf.set(prefix + ".tenant", credentials['projectId']) hconf.set(prefix + ".username", credentials['userId']) hconf.set(prefix + ".password", credentials['password']) hconf.setInt(prefix + ".http.port", 8080) hconf.set(prefix + ".region", credentials['region']) hconf.setBoolean(prefix + ".public", True) set_hadoop_config(credentials) - Py4JJavaErrorTraceback (most recent call last) in () > 1 train.groupby('Acordo').count().show() *Py4JJavaError: An error occurred while calling* o406.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in stage 30.0 (TID 2556, yp-spark-dal09-env5-0039): org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException:* Missing mandatory configuration option: fs.swift.service.keystone.auth.url* at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:223) at org.apache.hadoop.fs.swift.http.RestClientBindings.bind(RestClientBindings.java:147) *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
Re: year out of range
Thanks Ayan! *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Thu, Sep 8, 2016 at 7:54 PM, ayan guha <guha.a...@gmail.com> wrote: > Another way of debugging would be writing another UDF, returning string. > Also, in that function, put something useful in catch block, so you can > filter those records from df. > On 9 Sep 2016 03:41, "Daniel Lopes" <dan...@onematch.com.br> wrote: > >> Thanks Mike, >> >> A good way to debug! Was that already! >> >> Best, >> >> *Daniel Lopes* >> Chief Data and Analytics Officer | OneMatch >> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes >> >> www.onematch.com.br >> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> >> >> On Thu, Sep 8, 2016 at 2:26 PM, Mike Metzger <m...@flexiblecreations.com> >> wrote: >> >>> My guess is there's some row that does not match up with the expected >>> data. While slower, I've found RDDs to be easier to troubleshoot this kind >>> of thing until you sort out exactly what's happening. >>> >>> Something like: >>> >>> raw_data = sc.textFile("") >>> rowcounts = raw_data.map(lambda x: (len(x.split(",")), >>> 1)).reduceByKey(lambda x,y: x+y) >>> rowcounts.take(5) >>> >>> badrows = raw_data.filter(lambda x: len(x.split(",")) != >> number of columns>) >>> if badrows.count() > 0: >>> badrows.saveAsTextFile("") >>> >>> >>> You should be able to tell if there are any rows with column counts that >>> don't match up (the thing that usually bites me with CSV conversions). >>> Assuming these all match to what you want, I'd try mapping the unparsed >>> date column out to separate fields and try to see if a year field isn't >>> matching the expected values. >>> >>> Thanks >>> >>> Mike >>> >>> >>> On Thu, Sep 8, 2016 at 8:15 AM, Daniel Lopes <dan...@onematch.com.br> >>> wrote: >>> >>>> Thanks, >>>> >>>> I *tested* the function offline and works >>>> Tested too with select * from after convert the data and see the new >>>> data good >>>> *but* if I *register as temp table* to *join other table* stilll shows *the >>>> same error*. >>>> >>>> ValueError: year out of range >>>> >>>> Best, >>>> >>>> *Daniel Lopes* >>>> Chief Data and Analytics Officer | OneMatch >>>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes >>>> >>>> www.onematch.com.br >>>> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> >>>> >>>> On Thu, Sep 8, 2016 at 9:43 AM, Marco Mistroni <mmistr...@gmail.com> >>>> wrote: >>>> >>>>> Daniel >>>>> Test the parse date offline to make sure it returns what you expect >>>>> If it does in spark shell create a df with 1 row only and run ur >>>>> UDF. U should b able to see issue >>>>> If not send me a reduced CSV file at my email and I give it a try this >>>>> eve hopefully someone else will b able to assist in meantime >>>>> U don't need to run a full spark app to debug issue >>>>> Ur problem. Is either in the parse date or in what gets passed to the >>>>> UDF >>>>> Hth >>>>> >>>>> On 8 Sep 2016 1:31 pm, "Daniel Lopes" <dan...@onematch.com.br> wrote: >>>>> >>>>>> Thanks Marco for your response. >>>>>> >>>>>> The field came encoded by SQL Server in locale pt_BR. >>>>>> >>>>>> The code that I am formating is: >>>>>> >>>>>> -- >>>>>> def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'): >>>>>> try: >>>>>> locale.setlocale(locale.LC_TIME, 'pt_BR.utf8') >>>>>> return datetime.strptime(argument, format_date) >>>>>> except: >>>>>> return None >>>>>&
Re: year out of range
Thanks Mike, A good way to debug! Was that already! Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Thu, Sep 8, 2016 at 2:26 PM, Mike Metzger <m...@flexiblecreations.com> wrote: > My guess is there's some row that does not match up with the expected > data. While slower, I've found RDDs to be easier to troubleshoot this kind > of thing until you sort out exactly what's happening. > > Something like: > > raw_data = sc.textFile("") > rowcounts = raw_data.map(lambda x: (len(x.split(",")), > 1)).reduceByKey(lambda x,y: x+y) > rowcounts.take(5) > > badrows = raw_data.filter(lambda x: len(x.split(",")) != of columns>) > if badrows.count() > 0: > badrows.saveAsTextFile("") > > > You should be able to tell if there are any rows with column counts that > don't match up (the thing that usually bites me with CSV conversions). > Assuming these all match to what you want, I'd try mapping the unparsed > date column out to separate fields and try to see if a year field isn't > matching the expected values. > > Thanks > > Mike > > > On Thu, Sep 8, 2016 at 8:15 AM, Daniel Lopes <dan...@onematch.com.br> > wrote: > >> Thanks, >> >> I *tested* the function offline and works >> Tested too with select * from after convert the data and see the new data >> good >> *but* if I *register as temp table* to *join other table* stilll shows *the >> same error*. >> >> ValueError: year out of range >> >> Best, >> >> *Daniel Lopes* >> Chief Data and Analytics Officer | OneMatch >> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes >> >> www.onematch.com.br >> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> >> >> On Thu, Sep 8, 2016 at 9:43 AM, Marco Mistroni <mmistr...@gmail.com> >> wrote: >> >>> Daniel >>> Test the parse date offline to make sure it returns what you expect >>> If it does in spark shell create a df with 1 row only and run ur UDF. >>> U should b able to see issue >>> If not send me a reduced CSV file at my email and I give it a try this >>> eve hopefully someone else will b able to assist in meantime >>> U don't need to run a full spark app to debug issue >>> Ur problem. Is either in the parse date or in what gets passed to the UDF >>> Hth >>> >>> On 8 Sep 2016 1:31 pm, "Daniel Lopes" <dan...@onematch.com.br> wrote: >>> >>>> Thanks Marco for your response. >>>> >>>> The field came encoded by SQL Server in locale pt_BR. >>>> >>>> The code that I am formating is: >>>> >>>> -- >>>> def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'): >>>> try: >>>> locale.setlocale(locale.LC_TIME, 'pt_BR.utf8') >>>> return datetime.strptime(argument, format_date) >>>> except: >>>> return None >>>> >>>> convert_date = funcspk.udf(lambda x: parse_date(x, '%b %d %Y %H:%M'), >>>> TimestampType()) >>>> >>>> transacoes = transacoes.withColumn('tr_Vencimento', >>>> convert_date(transacoes.*tr_Vencimento*)) >>>> >>>> -- >>>> >>>> the sample is >>>> >>>> - >>>> +-++-+-- >>>> --+--+---+-+ >>>> -+--+--+ >>>> +-+-+--+ >>>> +++- >>>> ---+--+++--+ >>>> -+--+ >>>> |tr_NumeroContrato|tr_TipoDocumento|*tr_Vencimento*|tr_Valor|tr_Dat >>>> aRecebimento|tr_TaxaMora|tr_DescontoMaximo|tr_DescontoMaximo >>>> Corr|tr_ValorAtualizado|tr_ComGarantia|tr_ValorDesconto|tr_V >>>> alorJuros|tr_ValorMulta|tr_DataDevolucaoCheque|tr_ValorCorrigidoContratante| >>>> tr_DataNotificacao|tr_Banco|tr_Praca|tr_DescricaoAlinea|tr_ >>>> Enquadramento|tr_Linha|tr_Arquivo|tr_DataImportacao|tr_Agencia| >>>> +-++-
Re: year out of range
Thanks, I *tested* the function offline and works Tested too with select * from after convert the data and see the new data good *but* if I *register as temp table* to *join other table* stilll shows *the same error*. ValueError: year out of range Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Thu, Sep 8, 2016 at 9:43 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Daniel > Test the parse date offline to make sure it returns what you expect > If it does in spark shell create a df with 1 row only and run ur UDF. U > should b able to see issue > If not send me a reduced CSV file at my email and I give it a try this eve > hopefully someone else will b able to assist in meantime > U don't need to run a full spark app to debug issue > Ur problem. Is either in the parse date or in what gets passed to the UDF > Hth > > On 8 Sep 2016 1:31 pm, "Daniel Lopes" <dan...@onematch.com.br> wrote: > >> Thanks Marco for your response. >> >> The field came encoded by SQL Server in locale pt_BR. >> >> The code that I am formating is: >> >> -- >> def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'): >> try: >> locale.setlocale(locale.LC_TIME, 'pt_BR.utf8') >> return datetime.strptime(argument, format_date) >> except: >> return None >> >> convert_date = funcspk.udf(lambda x: parse_date(x, '%b %d %Y %H:%M'), >> TimestampType()) >> >> transacoes = transacoes.withColumn('tr_Vencimento', >> convert_date(transacoes.*tr_Vencimento*)) >> >> -- >> >> the sample is >> >> - >> +-++-+-- >> --+--+---+-+ >> -+--+--+ >> +-+-+--+ >> +++- >> ---+--+++--+ >> -+--+ >> |tr_NumeroContrato|tr_TipoDocumento|*tr_Vencimento*|tr_Valor|tr_Dat >> aRecebimento|tr_TaxaMora|tr_DescontoMaximo|tr_DescontoMaxi >> moCorr|tr_ValorAtualizado|tr_ComGarantia|tr_ValorDesconto|tr >> _ValorJuros|tr_ValorMulta|tr_DataDevolucaoCheque|tr_ValorCorrigidoContratante| >> tr_DataNotificacao|tr_Banco|tr_Praca|tr_DescricaoAlinea|tr_ >> Enquadramento|tr_Linha|tr_Arquivo|tr_DataImportacao|tr_Agencia| >> +-++-+-- >> --+--+---+-+ >> -+--+--+ >> +-+-+--+ >> +++- >> ---+--+++--+ >> -+--+ >> | 992600153001||*Jul 20 2015 12:00*| 254.35| >>null| null| null| null| >>null| 0|null| null| null| >>null| 254.35|2015-07-20 12:00:...|null| >>null| null|null|null| null| >> null| null| >> | 992600153001||*Abr 20 2015 12:00*| 254.35| >>null| null| null| null| >>null| 0|null| null| null| >>null| 254.35|null|null| >>null| null|null|null| null| >> null| null| >> | 992600153001||Nov 20 2015 12:00| 254.35| >>null| null| null| null| >> null| 0|null| null| null| >> null| 254.35|2015-11-20 12:00:...|null| >> null| null|null|null| null| >> null| null| >> | 992600153001||Dez 20 2015 12:00| 254.35| >>null| null| null| null| >> null| 0|null| null| null| >> null| 254.35|null|null| >> null| null|null|null| null| >> null| null| >> |
Re: year out of range
| null| null| 0|null| null| null| null| 254.35|2015-01-20 12:00:...|null| null| null|null|null| null| null| null| | 992600153001||Set 20 2015 12:00| 254.35| null| null| null| null| null| 0|null| null| null| null| 254.35|null|null| null| null|null|null| null| null| null| | 992600153001||Mai 20 2015 12:00| 254.35| null| null| null| null| null| 0|null| null| null| null| 254.35|null|null| null| null|null|null| null| null| null| | 992600153001||Out 20 2015 12:00| 254.35| null| null| null| null| null| 0|null| null| null| null| 254.35|null|null| null| null|null|null| null| null| null| | 992600153001||Mar 20 2015 12:00| 254.35| null| null| null| null| null| 0|null| null| null| null| 254.35|2015-03-20 12:00:...|null| null| null|null|null| null| null| null| +-++-++--+---+-+-+--+--++-+-+--+++++--+++--+-+--+ - *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Thu, Sep 8, 2016 at 5:33 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Pls paste code and sample CSV > I m guessing it has to do with formatting time? > Kr > > On 8 Sep 2016 12:38 am, "Daniel Lopes" <dan...@onematch.com.br> wrote: > >> Hi, >> >> I'm* importing a few CSV*s with spark-csv package, >> Always when I give a select at each one looks ok >> But when i join then with sqlContext.sql give me this error >> >> all tables has fields timestamp >> >> joins are not with this dates >> >> >> *Py4JJavaError: An error occurred while calling o643.showString.* >> : org.apache.spark.SparkException: Job aborted due to stage failure: >> Task 54 in stage 92.0 failed 10 times, most recent failure: Lost task 54.9 >> in stage 92.0 (TID 6356, yp-spark-dal09-env5-0036): >> org.apache.spark.api.python.PythonException: Traceback (most recent call >> last): >> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >> lib/pyspark.zip/pyspark/worker.py", line 111, in main >> process() >> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >> lib/pyspark.zip/pyspark/worker.py", line 106, in process >> serializer.dump_stream(func(split_index, iterator), outfile) >> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >> lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream >> vs = list(itertools.islice(iterator, batch)) >> File "/usr/local/src/spark160master/spark/python/pyspark/sql/functions.py", >> line 1563, in >> func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it) >> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >> lib/pyspark.zip/pyspark/sql/types.py", line 191, in toInternal >> else time.mktime(dt.timetuple())) >> *ValueError: year out of range * >> >> Any one knows this problem? >> >> Best, >> >> *Daniel Lopes* >> Chief Data and Analytics Officer | OneMatch >> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes >> >> www.onematch.com.br >> <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> >> >
year out of range
Hi, I'm* importing a few CSV*s with spark-csv package, Always when I give a select at each one looks ok But when i join then with sqlContext.sql give me this error all tables has fields timestamp joins are not with this dates *Py4JJavaError: An error occurred while calling o643.showString.* : org.apache.spark.SparkException: Job aborted due to stage failure: Task 54 in stage 92.0 failed 10 times, most recent failure: Lost task 54.9 in stage 92.0 (TID 6356, yp-spark-dal09-env5-0036): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/usr/local/src/spark160master/spark/python/pyspark/sql/functions.py", line 1563, in func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it) File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/sql/types.py", line 191, in toInternal else time.mktime(dt.timetuple())) *ValueError: year out of range * Any one knows this problem? Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes>
Re: unsubscribe
please send to user-unsubscr...@spark.apache.org *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?utm_source=EmailSignature_term=daniel-lopes> On Tue, Aug 2, 2016 at 10:11 AM, <doovs...@sina.com> wrote: > unsubscribe > > > > > ZhangYi (张逸) > > BigEye > > website: http://www.bigeyedata.com > > blog: http://zhangyi.farbox.com > > tel: 15023157626 > > > > > - 原始邮件 - > 发件人:"zhangjp" <592426...@qq.com> > 收件人:"user" <user@spark.apache.org> > 主题:unsubscribe > 日期:2016年08月02日 11点00分 > > unsubscribe >
Check out Kyper! Trying to be Uber of Data
I just signed up for Kyper and thought you might be interested, too! http://l.aunch.us/L7Ezb
Re: unsubscribe)
Hi Uzi, To unsubscribe e-mail: user-unsubscr...@spark.apache.org *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br <http://www.onematch.com.br/?pk_campaign=EmailSignature_kwd=daniel-lopes> On Mon, Jul 25, 2016 at 2:36 AM, Uzi Hadad <uziha...@mta.ac.il> wrote: > >
Re: Scala VS Java VS Python
For me Scala is better like Spark is written in Scala, and I like python cuz I always used python for data science. :) On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia <h...@danielvaldivia.com> wrote: > Hello, > > This is more of a "survey" question for the community, you can reply to me > directly so we don't flood the mailing list. > > I'm having a hard time learning Spark using Python since the API seems to > be slightly incomplete, so I'm looking at my options to start doing all my > apps in either Scala or Java, being a Java Developer, java 1.8 looks like > the logical way, however I'd like to ask here what's the most common (Scala > Or Java) since I'm observing mixed results in the social documentation, > however Scala seems to be the predominant language for spark examples. > > Thank for the advice > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Daniel Lopes, B.Eng* Data Scientist - BankFacil CREA/SP 5069410560 <http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334> Mob +55 (18) 99764-2733 <callto:+5518997642733> Ph +55 (11) 3522-8009 http://about.me/dannyeuu Av. Nova Independência, 956, São Paulo, SP Bairro Brooklin Paulista CEP 04570-001 https://www.bankfacil.com.br
Spark 1.5.2 + Hive 1.0.0 in Amazon EMR 4.2.0
Hi, I get this error when trying to write Spark DataFrame to Hive Table Stored as TextFile sqlContext.sql('INSERT OVERWRITE TABLE analytics.client_view_stock *(hive table)* SELECT * FROM client_view_stock'*(spark temp table)*') Erro: 15/11/30 21:40:14 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 5ADBECA2D82A7C17), S3 Extended Request ID: RcPfjgWaeXG62xyVRrAr91sVQNxktqbXUPJgK2cvZlf6SKEAOnWCtV9X9K1Vp9dAyDhGALQRBcU=], ServiceName=*[Amazon S3], AWSErrorCode=[404 Not Found]*, AWSRequestID=[5ADBECA2D82A7C17], ServiceEndpoint=[ https://my-bucket.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[214.69], HttpRequestTime=[214.245], HttpClientReceiveResponseTime=[212.513], RequestSigningTime=[0.16], HttpClientSendRequestTime=[0.112], 15/11/30 21:40:21 INFO Hive: Replacing src:s3://my-bucket/output/2015/11/29/client_view_stock/.hive-staging_hive_2015-11-30_21-19-48_942_238078420083598647-1/-ext-1/part-00199, dest: s3://my-bucket/output/2015/11/29/client_view_stock/part-00199, Status:true -chgrp: '' does not match expected pattern for group Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... 15/11/30 21:40:21 INFO latency: StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[2509AE55A8D71A61], ServiceEndpoint=[https://my-bucket. s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[137.387], HttpRequestTime=[136.721], HttpClientReceiveResponseTime=[134.805], RequestSigningTime=[0.235], ResponseProcessingTime=[0.169], HttpClientSendRequestTime=[0.145], 15/11/30 21:40:21 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect. org.apache.thrift.TApplication*Exception: Invalid method name: 'alter_table_with_cascade'* Thanks! -- *Daniel Lopes, B.Eng* Data Scientist - BankFacil CREA/SP 5069410560 <http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334> Mob +55 (18) 99764-2733 <callto:+5518997642733> Ph +55 (11) 3522-8009 http://about.me/dannyeuu Av. Nova Independência, 956, São Paulo, SP Bairro Brooklin Paulista CEP 04570-001 https://www.bankfacil.com.br
Re: UDF with 2 arguments
Thanks Davies and Nathan, I found my error. I was using *ArrayType()* and I need to pass de kind of type has in this array and I has not passing *ArrayType(IntegerType())*. Thanks :) On Wed, Nov 25, 2015 at 7:46 PM, Davies Liu <dav...@databricks.com> wrote: > It works in master (1.6), what's the version of Spark you have? > > >>> from pyspark.sql.functions import udf > >>> def f(a, b): pass > ... > >>> my_udf = udf(f) > >>> from pyspark.sql.types import * > >>> my_udf = udf(f, IntegerType()) > > > On Wed, Nov 25, 2015 at 12:01 PM, Daniel Lopes <dan...@bankfacil.com.br> > wrote: > > Hallo, > > > > supose I have function in pyspark that > > > > def function(arg1,arg2): > > pass > > > > and > > > > udf_function = udf(function, IntegerType()) > > > > that takes me error > > > > Traceback (most recent call last): > > File "", line 1, in > > TypeError: __init__() takes at least 2 arguments (1 given) > > > > > > How I use? > > > > Best, > > > > > > -- > > Daniel Lopes, B.Eng > > Data Scientist - BankFacil > > CREA/SP 5069410560 > > Mob +55 (18) 99764-2733 > > Ph +55 (11) 3522-8009 > > http://about.me/dannyeuu > > > > Av. Nova Independência, 956, São Paulo, SP > > Bairro Brooklin Paulista > > CEP 04570-001 > > https://www.bankfacil.com.br > > > -- *Daniel Lopes, B.Eng* Data Scientist - BankFacil CREA/SP 5069410560 <http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334> Mob +55 (18) 99764-2733 <callto:+5518997642733> Ph +55 (11) 3522-8009 http://about.me/dannyeuu Av. Nova Independência, 956, São Paulo, SP Bairro Brooklin Paulista CEP 04570-001 https://www.bankfacil.com.br
spark-csv on Amazon EMR
Hi, Some know how to use spark-csv in create-cluster statement of Amazon EMR CLI? Best, -- *Daniel Lopes, B.Eng* Data Scientist - BankFacil CREA/SP 5069410560 <http://edital.confea.org.br/ConsultaProfissional/cartao.aspx?rnp=2613651334> Mob +55 (18) 99764-2733 <callto:+5518997642733> Ph +55 (11) 3522-8009 http://about.me/dannyeuu Av. Nova Independência, 956, São Paulo, SP Bairro Brooklin Paulista CEP 04570-001 https://www.bankfacil.com.br