Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
I have downloaded Amazon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format test.ft.txt.bz2.zip So it is a text file that is compressed by bz2 followed by zip. Now I like tro do all these operations in PySpark

How to read text files with GBK encoding in the spark core

2023-04-30 Thread lianyou1...@126.com
Hello all, Is there any way to use the pyspark core to read some text files with GBK encoding? Although the pyspark sql has an option to set the encoding, but these text files are not structural format. Any advices are appreciated. Thank you lianyou Li

Loading a text file

2022-03-14 Thread Hinko Kocevar
I have a standalone spark 3.2.0 cluster with two workers started on PC_A and want to run a pyspark job from PC_B. The job wants to load a text file. I keep getting file not found error messages when I execute the job. Folder/file "/home/bddev/parrot/words.txt" exists on PC_B but n

Re: Help With unstructured text file with spark scala

2022-02-25 Thread Danilo Sousa
_file) >>> >>> df.show() >>> ++---+-+ >>> | Plano|Código Beneficiário|Nome Beneficiário| >>> ++---+-+ >>> |58693 - NAC

Re: Help With unstructured text file with spark scala

2022-02-21 Thread Danilo Sousa
5751388| Julia Silva| >> ++---+-+ >> >> >> cat csv_file: >> >> Plano#Código Beneficiário#Nome Beneficiário >> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva >> 58693 - NACIONAL R COPART PJ

restoring SQL text from logical plan

2022-02-16 Thread Wang Cheng
to sql to recommend to the user. I??m wondering is there any function that converts a logical plan back to SQL text?

Re: Help With unstructured text file with spark scala

2022-02-13 Thread Rafael Mendes
Jose Silva| >> |58693 - NACIONAL ...| 65751388| Joana Silva| >> |58693 - NACIONAL ...| 65751353| Felipe Silva| >> |58693 - NACIONAL ...| 65751388| Julia Silva| >> ++---+-+

Re: Help With unstructured text file with spark scala

2022-02-09 Thread Bitfox
t; Plano#Código Beneficiário#Nome Beneficiário > 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva > 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva > 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva > > 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva >

Re: Help With unstructured text file with spark scala

2022-02-09 Thread Danilo Sousa
open attachments unless you can confirm the sender and know > the content is safe. > > > >Hi >I have to transform unstructured text to dataframe. >Could anyone please help with Scala code ? > >Dataframe need as: > >operadora filial un

Re: Help With unstructured text file with spark scala

2022-02-09 Thread Danilo Sousa
#065751353#Jose Silva > 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva > 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva > 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva > > > Regards > > > On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa <mail

Re: Help With unstructured text file with spark scala

2022-02-08 Thread Bitfox
va 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva Regards On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa wrote: > Hi > I have to transform unstructured text to dataframe. &g

Re: Help With unstructured text file with spark scala

2022-02-08 Thread Lalwani, Jayesh
, 11:50 AM, "Danilo Sousa" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi I have to transform unstructured text to dataframe. Coul

Help With unstructured text file with spark scala

2022-02-08 Thread Danilo Sousa
Hi I have to transform unstructured text to dataframe. Could anyone please help with Scala code ? Dataframe need as: operadora filial unidade contrato empresa plano codigo_beneficiario nome_beneficiario Relação de Beneficiários Ativos e Excluídos Carteira em#27/12/2019##Todos os Beneficiários

Re: df.show() to text file

2021-12-25 Thread papadopo
/12/21 03:49 (GMT+02:00) Προς: bit...@bitfox.top Κοιν.: User Θέμα: Re: df.show() to text file You can redirect the stdout of your program I guess but show is for display, not saving data. Use df.write methods for that. On Fri, Dec 24, 2021, 7:02 PM wrote:Hello list, spark newbie here :0 How

Re: df.show() to text file

2021-12-24 Thread Sean Owen
You can redirect the stdout of your program I guess but show is for display, not saving data. Use df.write methods for that. On Fri, Dec 24, 2021, 7:02 PM wrote: > Hello list, > > spark newbie here :0 > How can I write the df.show() result to a text file in the system? > I r

df.show() to text file

2021-12-24 Thread bitfox
Hello list, spark newbie here :0 How can I write the df.show() result to a text file in the system? I run with pyspark, not the python client programming. Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Reading the last line of each file in a set of text files

2021-08-03 Thread Artemis User
PM, Sayeh Roshan wrote: Hi users, Does anyone here has experience with written spark code that just read the last line of each text file in a directory, s3 bucket, etc? I am looking for a solution that doesn’t require reading the whole file. I basically wonder whether you can create a data frame

Reading the last line of each file in a set of text files

2021-08-02 Thread Sayeh Roshan
Hi users, Does anyone here has experience with written spark code that just read the last line of each text file in a directory, s3 bucket, etc? I am looking for a solution that doesn’t require reading the whole file. I basically wonder whether you can create a data frame/Rdd using file seek

Re: CSV data source : Garbled Japanese text and handling multilines

2020-05-20 Thread ZHANG Wei
May I get the CSV file's encoding, which can be checked by `file` command? -- Cheers, -z On Tue, 19 May 2020 09:24:24 +0900 Ashika Umagiliya wrote: > In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files > contain Japanese characters.Also they can have ^M character (u000D)

CSV data source : Garbled Japanese text and handling multilines

2020-05-18 Thread Ashika Umagiliya
In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline. First I used following code to read CSV files: implicit class DataFrameReadImplicits (dataFrameReader:

Re: Read text file row by row and apply conditions

2019-09-30 Thread hemant singh
nds, > > I am new to spark. can you please help me to read the below text file > using spark and scala. > > Sample data > > bret|lee|A|12345|ae545|gfddfg|86786786 > 142343345||D|ae342 > 67567|6|U|aadfsd|34k4|84304|020|sdnfsdfn|3243|benej|32432|jsfsdf|3423 > 675

Re: Read text file row by row and apply conditions

2019-09-30 Thread vaquar khan
kadiyala wrote: > dear friends, > > I am new to spark. can you please help me to read the below text file > using spark and scala. > > Sample data > > bret|lee|A|12345|ae545|gfddfg|86786786 > 142343345||D|ae342 > 67567|6|U|aadfsd|34k4|84304|020|sdnfsdfn|3243|b

Read text file row by row and apply conditions

2019-09-29 Thread swetha kadiyala
dear friends, I am new to spark. can you please help me to read the below text file using spark and scala. Sample data bret|lee|A|12345|ae545|gfddfg|86786786 142343345||D|ae342 67567|6|U|aadfsd|34k4|84304|020|sdnfsdfn|3243|benej|32432|jsfsdf|3423 67564|67747|U|aad434|3435|843454|203

Phrase Search using Apache Spark in huge amount of text in files

2019-05-28 Thread Sandeep Giri
Dear Spark Users, If you want to search a list of phrases, approx. 10,000 each having words between 1 to 6, in a large amount of text (approximately 10GB) how do you go about it? I ended up wiring a small RDD based libraries: https://github.com/cloudxlab/phrasesearch I would like to get feedback

RE: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread email
to print DataFrame.show(100) to text file at HDFS Use .limit on the dataframe followed by .write On Apr 14, 2019, at 5:10 AM, Chetan Khatri mailto:chetan.opensou...@gmail.com> > wrote: Nuthan, Thank you for reply. the solution proposed will give everything. for me is like one Datafram

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread Brandon Geise
Use .limit on the dataframe followed by .write On Apr 14, 2019, 5:10 AM, at 5:10 AM, Chetan Khatri wrote: >Nuthan, > >Thank you for reply. the solution proposed will give everything. for me >is >like one Dataframe show(100) in 3000 lines of Scala Spark code. >However, yarn logs --applicationId

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread Chetan Khatri
Nuthan, Thank you for reply. the solution proposed will give everything. for me is like one Dataframe show(100) in 3000 lines of Scala Spark code. However, yarn logs --applicationId > 1.log also gives all stdout and stderr. Thanks On Sun, Apr 14, 2019 at 10:30 AM Nuthan Reddy wrote: > Hi

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Nuthan Reddy
Hi Chetan, You can use spark-submit showDF.py | hadoop fs -put - showDF.txt showDF.py: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Write stdout").getOrCreate() spark.sparkContext.setLogLevel("OFF") spark.table("").show(100,truncate=false) But is there any

How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Chetan Khatri
Hello Users, In spark when I have a DataFrame and do .show(100) the output which gets printed, I wants to save as it is content to txt file in HDFS. How can I do this? Thanks

Re: Spark2: Deciphering saving text file name

2019-04-09 Thread Jason Nerothin
Hi Subash, Short answer: It’s effectively random. Longer answer: In general the DataFrameWriter expects to be receiving data from multiple partitions. Let’s say you were writing to ORC instead of text. In this case, even when you specify the output path, the writer creates a directory

Spark2: Deciphering saving text file name

2019-04-08 Thread Subash Prabakar
Hi, While saving in Spark2 as text file - I see encoded/hash value attached in the part files along with part number. I am curious to know what is that value is about ? Example: ds.write.save(SaveMode.Overwrite).option("compression","gzip").text(path) Produces, part-1-1e4

Encoding issue reading text file

2018-10-18 Thread Masf
Hi everyone, I´m trying to read a text file with UTF-16LE but I´m getting weird characters like this: �� W h e n My code is this one: sparkSession .read .format("text") .option("charset", "UTF-16LE") .load("textfile.txt")

Re: Text from pdf spark

2018-09-28 Thread Joel D
ent from my iPhone > > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs using pdfBox. > > However it throws an error: > > "Exception in thread "main" org.apache.spark.SparkException: ... > > java.io.FileNo

Re: Text from pdf spark

2018-09-28 Thread kathleen li
The error message is “file not found” Are you able to use the following command line to assess the file with the user you submitted the job? hdfs dfs -ls /tmp/sample.pdf Sent from my iPhone > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf file

Text from pdf spark

2018-09-28 Thread Joel D
I'm trying to extract text from pdf files in hdfs using pdfBox. However it throws an error: "Exception in thread "main" org.apache.spark.SparkException: ... java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf (No such file or directory)" What am I missing?

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-21 Thread vermanurag
Try to_json on the vector column. That should do it. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi, I was hoping that there is a casting vector into String method (instead of writing my UDF), so that it can then be serialized it into csv/text file. Best regards, Mina On Tue, Feb 20, 2018 at 6:52 PM, vermanurag <anurag.ve...@fnmathlogic.com> wrote: > If your dataframe has colu

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi Snehasish, Unfortunately, none of the solutions worked. Regards, Mina On Tue, Feb 20, 2018 at 5:12 PM, SNEHASISH DUTTA <info.snehas...@gmail.com> wrote: > Hi Mina, > > Even text won't work you may try this df.coalesce(1).write.option("h > eader","true

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread vermanurag
If your dataframe has columns types like vector then you cannot save as csv/ text as there are no direct equivalent supported by flat formats like csv/ text. You may need to convert the column type appropriately (eg. convert the incompatible column to StringType before saving the output as csv

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, Even text won't work you may try this df.coalesce(1).write.option("h eader","true").mode("overwrite").save("output",format=text) Else convert to an rdd and use saveAsTextFile Regards, Snehasish On Wed, Feb 21, 2018 at 3:38 AM, SNEHASISH DUTTA

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, This might work then df.coalesce(1).write.option("header","true").mode("overwrite ").text("output") Regards, Snehasish On Wed, Feb 21, 2018 at 3:21 AM, Mina Aslani <aslanim...@gmail.com> wrote: > Hi Snehasish, > > Using df.co

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
3 AM, Mina Aslani <aslanim...@gmail.com> wrote: > >> Hi, >> >> I would like to serialize a dataframe with vector values into a text/csv >> in pyspark. >> >> Using below line, I can write the dataframe(e.g. df) as parquet, however >> I cannot open it in excel/as text. >> df.coalesce(1).write.option("header","true").mode("overwrite >> ").save("output") >> >> Best regards, >> Mina >> >> >

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, This might help df.coalesce(1).write.option("header","true").mode("overwrite").csv("output") Regards, Snehasish On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani <aslanim...@gmail.com> wrote: > Hi, > > I would like to serialize a da

Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi, I would like to serialize a dataframe with vector values into a text/csv in pyspark. Using below line, I can write the dataframe(e.g. df) as parquet, however I cannot open it in excel/as text. df.coalesce(1).write.option("header","true").mode(" overwrite").save("output") Best regards, Mina

Write a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi, I would like to write a dataframe with vactor values into a text/csv file. Using below line, I can write it as parquet, however I cannot open it in excel/as text. df.coalesce(1).write.option("header","true").mode("overwrite").save("stage-s3logs-model&

Re: text processing in spark (Spark job stucks for several minutes)

2017-10-26 Thread Jörn Franke
Please provide source code and exceptions that are in executor and/or driver log. > On 26. Oct 2017, at 08:42, Donni Khan <prince.don...@googlemail.com> wrote: > > Hi, > I'm applying preprocessing methods on big data of text by using spark-Java. I > created my own NLP pip

text processing in spark (Spark job stucks for several minutes)

2017-10-26 Thread Donni Khan
Hi, I'm applying preprocessing methods on big data of text by using spark-Java. I created my own NLP pipline as a normal java code and call it in the map function like this: MyRDD.map(call nlp pipeline fr each row) I run my job in a cluster 14 machines(32 Cores and about 140G for each). The job

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2017-10-20 Thread lmk
Trying to improve the old solution. Do we have a better text classifier now in Spark Mllib? Regards, lmk -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr

Exception: JDK-8154035 using Whole text files api

2017-07-05 Thread Reth RM
Hi, Using sc.wholeTextFiles to read warc file (example file here ). Spark reporting an error with stack trace pasted here : https://pastebin.com/qfmM2eKk Looks like its same as bug reported here:

Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

2017-06-09 Thread Barona, Ricardo
: "Manjunath, Kiran" <kiman...@akamai.com> Date: Friday, June 9, 2017 at 1:47 PM To: "Barona, Ricardo" <ricardo.bar...@intel.com>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) dup

Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

2017-06-09 Thread Manjunath, Kiran
ot; <user@spark.apache.org> Subject: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows In Spark 1.6.0 I’m having an issue with saveAsText and write.mode.text where I have a data frame with 1M+ rows and then I do: dataFrame.limit(500).map(_.mkString(“\t”)).toDF(“row

RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

2017-06-09 Thread Barona, Ricardo
In Spark 1.6.0 I’m having an issue with saveAsText and write.mode.text where I have a data frame with 1M+ rows and then I do: dataFrame.limit(500).map(_.mkString(“\t”)).toDF(“row”).write.mode(SaveMode.Overwrite).text(“myHDFSFolder/results”) then when I check for the results file, I see 900

Re: Adding header to an rdd before saving to text file

2017-06-06 Thread Irving Duran
Not a best option, but I've done this before. If you know the columns structure you could manually write them to the file before exporting. On Tue, Jun 6, 2017 at 12:39 AM 颜发才(Yan Facai) wrote: > Hi, upendra. > It will be easier to use DataFrame to read/save csv file with

Re: Adding header to an rdd before saving to text file

2017-06-05 Thread Yan Facai
Hi, upendra. It will be easier to use DataFrame to read/save csv file with header, if you'd like. On Tue, Jun 6, 2017 at 5:15 AM, upendra 1991 wrote: > I am reading a CSV(file has headers header 1st,header2) and generating > rdd, > After few transformations I

Adding header to an rdd before saving to text file

2017-06-05 Thread upendra 1991
I am reading a CSV(file has headers header 1st,header2) and generating rdd,  After few transformations I create an rdd and finally write it to a txt file.  What's the best way to add the header from source file, into rdd and have it available as header into new file I.e, when I transform the rdd

Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread Sonal Goyal
> > I am converting my Java based NLP parser to execute it on my Spark > > cluster. I know that Spark can read multiple text files from a directory > > and convert into RDDs for further processing. My input data is not only > in > > text files, but in a multitude o

Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread docdwarf
tesmai4 wrote > I am converting my Java based NLP parser to execute it on my Spark > cluster. I know that Spark can read multiple text files from a directory > and convert into RDDs for further processing. My input data is not only in > text files, but in a multitude of different

Reading PDF/text/word file efficiently with Spark

2017-05-19 Thread tesm...@gmail.com
Hi, I am doing NLP (Natural Language Processing) processing on my data. The data is in form of files that can be of type PDF/Text/Word/HTML. These files are stored in a directory structure on my local disk, even nested directories. My stand alone Java based NLP parser can read input files, extract

Reading PDF/text/word file efficiently with Spark

2017-05-19 Thread tesmai4
Hi,I am doing NLP (Natural Language Processing) processing on my data. The data is in form of files that can be of type PDF/Text/Word/HTML. These files are stored in a directory structure on my local disk, even nested directories. My stand alone Java based NLP parser can read input files, extract

Re: Returning DataFrame for text file

2017-04-07 Thread Jacek Laskowski
Hi, What's the alternative? Dataset? You've got textFile then. It's an older API from the ages when Dataset was merely experimental. Jacek On 29 Mar 2017 8:58 p.m., "George Obama" wrote: > Hi, > > I saw that the API, either R or Scala, we are returning DataFrame for >

Re: Returning DataFrame for text file

2017-04-06 Thread Yan Facai
SparkSession.read returns a DataFrameReader. DataFrameReader supports a series of format, such as csv, json, text as you mentioned. check API to find more details: + http://spark.apache.org/docs/latest/api/scala/index.html#org .apache.spark.sql.SparkSession + http://spark.apache.org/docs/latest

Returning DataFrame for text file

2017-03-29 Thread George Obama
Hi, I saw that the API, either R or Scala, we are returning DataFrame for sparkSession.read.text() What’s the rational behind this? Regards, George

Re: Not able to remove header from a text file while creating a data frame .

2017-03-04 Thread KhajaAsmath Mohammed
ter", ",").load("data/datapoint_raw/BatteryVoltage.csv" On Sat, Mar 4, 2017 at 8:42 AM, <psw...@in.imshealth.com> wrote: > Hi All, > > > > I am reading a text file to create a dataframe . While I am trying to > exclude header form the tex

Not able to remove header from a text file while creating a data frame .

2017-03-04 Thread PSwain
Hi All, I am reading a text file to create a dataframe . While I am trying to exclude header form the text file I am not able r to do it . Now my concern is how to know what all options are there that I can use while reading from a source , I checked the API , there the Arguments in option

Re: JavaRDD text matadata(file name) findings

2017-02-01 Thread neil90
You can use the https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#wholeTextFiles(java.lang.String) but it will return a rdd as such (filename,content) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-text

Re: JavaRDD text matadata(file name) findings

2017-01-31 Thread Hyukjin Kwon
Hi, Are you maybe possible to switch it to text datasource with input_file_name function? Thanks. On 1 Feb 2017 3:58 a.m., "Manohar753" <manohar.re...@happiestminds.com> wrote: Hi All, myspark job is reading data from a folder having different files with same structured data.

JavaRDD text matadata(file name) findings

2017-01-31 Thread Manohar753
in context: http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-text-matadata-file-name-findings-tp28353.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr

Re: Text

2017-01-27 Thread Jörn Franke
> >> On 27 Jan 2017, at 10:44, Soheila S. <soheila...@gmail.com> wrote: >> >> Hi All, >> I read a test file using sparkContext.textfile(filename) and assign it to an >> RDD and process the RDD (replace some words) and finally write it to a text &

Re: Text

2017-01-27 Thread Jörn Franke
with the TextInputFormat. > On 27 Jan 2017, at 10:44, Soheila S. <soheila...@gmail.com> wrote: > > Hi All, > I read a test file using sparkContext.textfile(filename) and assign it to an > RDD and process the RDD (replace some words) and finally write it to a text > file using r

Re: Text

2017-01-27 Thread ayan guha
; I read a test file using sparkContext.textfile(filename) and assign it to > an RDD and process the RDD (replace some words) and finally write it to > a text file using rdd.saveAsTextFile(output). > Is there any way to be sure the order of the sentences will not be > changed? I need to ha

Re: Text

2017-01-27 Thread Md. Rezaul Karim
lename) and assign it to > an RDD and process the RDD (replace some words) and finally write it to > a text file using rdd.saveAsTextFile(output). > Is there any way to be sure the order of the sentences will not be > changed? I need to have the same text with some corrected words. > > thanks! > > Soheila >

Text

2017-01-27 Thread Soheila S.
Hi All, I read a test file using sparkContext.textfile(filename) and assign it to an RDD and process the RDD (replace some words) and finally write it to a text file using rdd.saveAsTextFile(output). Is there any way to be sure the order of the sentences will not be changed? I need to have

Help regarding reading text file within rdd operations

2016-10-25 Thread Rohit Verma
Hi Team, Please help me with scenario, I tried on stackoverflow but no response, so excuse me for mailing on this thread. I have two string lists containing text file path, List a, List b.I want to to cartesian product of list a,b to achieve a cartesian dataframe comparison. The way I am

Re: Reading the most recent text files created by Spark streaming

2016-09-15 Thread Mich Talebzadeh
l.com> > wrote: > > Hi, > > I have a Spark streaming that reads messages/prices from Kafka and writes > it as text file to HDFS. > > This is pretty efficient. Its only function is to persist the incoming > messages to HDFS. > > This is what it does >

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
','patnaik.sa...@gmail.com');>> wrote: > > These are not csv files, utf8 files with a specific delimiter. > I tried this out with a file(3 GB): > > myDF.write.json("output/myJson") > Time taken- 60 secs approximately. > > myDF.rdd.repartition(1).saveAsTextFi

Re: Reading the most recent text files created by Spark streaming

2016-09-14 Thread Jörn Franke
17:28, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi, > > I have a Spark streaming that reads messages/prices from Kafka and writes it > as text file to HDFS. > > This is pretty efficient. Its only function is to persist the incoming > messages t

Reading the most recent text files created by Spark streaming

2016-09-14 Thread Mich Talebzadeh
Hi, I have a Spark streaming that reads messages/prices from Kafka and writes it as text file to HDFS. This is pretty efficient. Its only function is to persist the incoming messages to HDFS. This is what it does dstream.foreachRDD { pricesRDD => val x= pricesRDD.co

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
> > myDF.write.json("output/myJson") > Time taken- 60 secs approximately. > > myDF.rdd.repartition(1).saveAsTextFile("output/text") > Time taken 160 secs > > That is where I am concerned, the time to write a text file compared to json > grows ex

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
As I understand you cannot deliver json file downstream as they want text format. If it is batch processing, what is the window of delivery within the SLA? To write a 3GB file in 160 seconds means that it takes > 50 seconds to write 1 Gig which looks a long time to me. Even talking one min

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
sing Spark SQL and Dataframes. This >> application has a bunch of file joins and there are intermediate points >> where I need to drop a file for downstream applications to consume. >> The problem is all these downstream applications are still on legacy, so >> they still requ

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
These are not csv files, utf8 files with a specific delimiter. I tried this out with a file(3 GB): myDF.write.json("output/myJson") Time taken- 60 secs approximately. myDF.rdd.repartition(1).saveAsTextFile("output/text") Time taken 160 secs That is where I am concerned, the

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
roblem is all these downstream applications are still on >legacy, so they still require us to drop them a text file.As you all must >be knowing Dataframe stores the data in columnar format internally. > > Only way I found out how to do this and which looks awfully slow is this:

Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
on legacy, so they still require us to drop them a text file.As you all must be knowing Dataframe stores the data in columnar format internally. Only way I found out how to do this and which looks awfully slow is this: myDF=sc.textFile("inputpath").toDF() myDF.rdd.repartition(1).save

Re: Splitting columns from a text file

2016-09-05 Thread Gourav Sengupta
just use SPARK CSV, all other ways of splitting and working is just trying to reinvent the wheel and a magnanimous waste of time. Regards, Gourav On Mon, Sep 5, 2016 at 1:48 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: > Hi, > > I have a text file as be

Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
ashok34...@yahoo.com> wrote: > Thanks everyone. > > I am not skilled like you gentlemen > > This is what I did > > 1) Read the text file > > val textFile = sc.textFile("/tmp/myfile.txt") > > 2) That produces an RDD of String. > > 3) Create a DF af

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Thanks everyone. I am not skilled like you gentlemen This is what I did 1) Read the text file val textFile = sc.textFile("/tmp/myfile.txt") 2) That produces an RDD of String. 3) Create a DF after splitting the file into an Array  val df = textFile.map(line => line.split(",&quo

Re: Splitting columns from a text file

2016-09-05 Thread ayan guha
tics.com>> wrote: > > > Basic error, you get back an RDD on transformations like map. > sc.textFile("filename").map(x => x.split(",") > > On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote: > > Hi, >

Re: Splitting columns from a text file

2016-09-05 Thread Fridtjof Sander
016, 13:51, Somasundaram Sekar <somasundar.sekar@ tigeranalytics.com <mailto:somasundar.se...@tigeranalytics.com>> wrote: Basic error, you get back an RDD on transformations like map. sc.textFile("filename").map(x => x.split(",") On 5 Sep 201

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
uot;,") On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,4

Re: Splitting columns from a text file

2016-09-05 Thread ayan guha
ics.com> wrote: > > > Basic error, you get back an RDD on transformations like map. > sc.textFile("filename").map(x => x.split(",") > > On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote: > > Hi, > > I have a

Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
kar@ > tigeranalytics.com> wrote: > > > Basic error, you get back an RDD on transformations like map. > sc.textFile("filename").map(x => x.split(",") > > On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote: > > Hi

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote: Hi, I have a text file as below that I read in 74,20160905-

Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
Basic error, you get back an RDD on transformations like map. sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote: > Hi, > > I have a text file as below that I read i

Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Hi, I have a text file as below that I read in 74,20160905-133143,98.1121806912882759414875,20160905-133143,49.5277699881591680774276,20160905-133143,56.0802995712398098455677,20160905-133143,46.636895265444075228,20160905-133143,84.8822714116440218155179,20160905

Re: Breaking down text String into Array elements

2016-08-23 Thread Mich Talebzadeh
(_ => chars(Random.nextInt(chars.length))).mkString spark.udf.register("randomString", randomString(_:String, _:Int)) case class columns (col1: Int, col2: String) //val chars = ('a' to 'z') ++ ('A' to 'Z') ++ ('0' to '9') ++ ("-!£$") val chars = ('a' to 'z') ++ ('A' to 'Z') v

Re: Breaking down text String into Array elements

2016-08-23 Thread RK Aduri
That’s because of this: scala> val text = Array((1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM&quo

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
> How about something like > > scala> val text = (1 to 10).map(i => (i.toString, > random_string(chars.mkString(""), 10))).toArray > > text: Array[(String, String)] = Array((1,FBECDoOoAC), (2,wvAyZsMZnt), > (3,KgnwObOFEG), (4,tAZPRodrgP), (5,uSgrqyZGuc),

Re: Breaking down text String into Array elements

2016-08-23 Thread Mich Talebzadeh
val chars = ('a' to 'z') ++ ('A' to 'Z') var text = "" val comma = "," val terminator = "))" var random_char = "" for (i <- 1 to 10) { random_char = random_string(chars.mkString(""), 10) if (i < 10) {text = text + """("

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
what is "text"? i.e. what is the "val text = ..." definition? If text is a String itself then indeed sc.parallelize(Array(text)) is doing the correct thing in this case. On Tue, 23 Aug 2016 at 19:42 Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > I am sure som

Breaking down text String into Array elements

2016-08-23 Thread Mich Talebzadeh
I am sure someone know this :) Created a dynamic text string which has format scala> println(text) (1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,

  1   2   3   >