Re: Help With unstructured text file with spark scala

2022-02-13 Thread Rafael Mendes
Hi, Danilo.
Do you have a single large file, only?
If so, I guess you can use tools like sed/awk to split it into more files
based on layout, so you can read these files into Spark.


Em qua, 9 de fev de 2022 09:30, Bitfox  escreveu:

> Hi
>
> I am not sure about the total situation.
> But if you want a scala integration I think it could use regex to match
> and capture the keywords.
> Here I wrote one you can modify by your end.
>
> import scala.io.Source
>
> import scala.collection.mutable.ArrayBuffer
>
>
> val list1 = ArrayBuffer[(String,String,String)]()
>
> val list2 = ArrayBuffer[(String,String)]()
>
>
>
> val patt1 = """^(.*)#(.*)#([^#]*)$""".r
>
> val patt2 = """^(.*)#([^#]*)$""".r
>
>
> val file = "1.txt"
>
> val lines = Source.fromFile(file).getLines()
>
>
> for ( x <- lines ) {
>
>   x match {
>
> case patt1(k,v,z) => list1 += ((k,v,z))
>
> case patt2(k,v) => list2 += ((k,v))
>
> case _ => println("no match")
>
>   }
>
> }
>
>
>
> Now the list1 and list2 have the elements you wanted, you can convert them
> to a dataframe easily.
>
>
> Thanks.
>
> On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa 
> wrote:
>
>> Hello
>>
>>
>> Yes, for this block I can open as csv with # delimiter, but have the
>> block that is no csv format.
>>
>> This is the likely key value.
>>
>> We have two different layouts in the same file. This is the “problem”.
>>
>> Thanks for your time.
>>
>>
>>
>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>>
>>> Contrato#123456 - Test
>>> Empresa#Test
>>
>>
>> On 9 Feb 2022, at 00:58, Bitfox  wrote:
>>
>> Hello
>>
>> You can treat it as a csf file and load it from spark:
>>
>> >>> df = spark.read.format("csv").option("inferSchema",
>> "true").option("header", "true").option("sep","#").load(csv_file)
>> >>> df.show()
>> ++---+-+
>> |   Plano|Código Beneficiário|Nome Beneficiário|
>> ++---+-+
>> |58693 - NACIONAL ...|   65751353|   Jose Silva|
>> |58693 - NACIONAL ...|   65751388|  Joana Silva|
>> |58693 - NACIONAL ...|   65751353| Felipe Silva|
>> |58693 - NACIONAL ...|   65751388|  Julia Silva|
>> ++---+-+
>>
>>
>> cat csv_file:
>>
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>
>>
>> Regards
>>
>>
>> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
>> wrote:
>>
>>> Hi
>>> I have to transform unstructured text to dataframe.
>>> Could anyone please help with Scala code ?
>>>
>>> Dataframe need as:
>>>
>>> operadora filial unidade contrato empresa plano codigo_beneficiario
>>> nome_beneficiario
>>>
>>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>>
>>> Contrato#123456 - Test
>>> Empresa#Test
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>>
>>> Contrato#898011000 - FUNDACAO GERDAU
>>> Empresa#FUNDACAO GERDAU
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Subscribe

2019-02-13 Thread Rafael Mendes