Re: partitioning json data in spark
Well, I could try to do that, but *partitionBy *method is anyway only supported for Parquet format even in Spark 1.5.1 Narek Narek Galstyan Նարեկ Գալստյան On 27 December 2015 at 21:50, Ted Yu wrote: > Is upgrading to 1.5.x a possibility for you ? > > Cheers > > On Sun, Dec 27, 2015 at 9:28 AM, Նարեկ Գալստեան > wrote: > >> >> http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter >> I did try but it all was in vain. >> It is also explicitly written in api docs that it only supports Parquet. >> >> >> >> Narek Galstyan >> >> Նարեկ Գալստյան >> >> On 27 December 2015 at 17:52, Igor Berman wrote: >> >>> have you tried to specify format of your output, might be parquet is >>> default format? >>> df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path"); >>> >>> On 27 December 2015 at 15:18, Նարեկ Գալստեան >>> wrote: >>> >>>> Hey all! >>>> I am willing to partition *json *data by a column name and store the >>>> result as a collection of json files to be loaded to another database. >>>> >>>> I could use spark's built in *partitonBy *function but it only outputs >>>> in parquet format which is not desirable for me. >>>> >>>> Could you suggest me a way to deal with this problem? >>>> Narek Galstyan >>>> >>>> Նարեկ Գալստյան >>>> >>> >>> >> >
Re: partitioning json data in spark
http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter I did try but it all was in vain. It is also explicitly written in api docs that it only supports Parquet. Narek Galstyan Նարեկ Գալստյան On 27 December 2015 at 17:52, Igor Berman wrote: > have you tried to specify format of your output, might be parquet is > default format? > df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path"); > > On 27 December 2015 at 15:18, Նարեկ Գալստեան wrote: > >> Hey all! >> I am willing to partition *json *data by a column name and store the >> result as a collection of json files to be loaded to another database. >> >> I could use spark's built in *partitonBy *function but it only outputs >> in parquet format which is not desirable for me. >> >> Could you suggest me a way to deal with this problem? >> Narek Galstyan >> >> Նարեկ Գալստյան >> > >
partitioning json data in spark
Hey all! I am willing to partition *json *data by a column name and store the result as a collection of json files to be loaded to another database. I could use spark's built in *partitonBy *function but it only outputs in parquet format which is not desirable for me. Could you suggest me a way to deal with this problem? Narek Galstyan Նարեկ Գալստյան
Re: Debug Spark
A question regarding the topic, I am using Intellij to write spark applications and then have to ship the source code to my cluster on the cloud to compile and test is there a way to automatise the process using Intellij? Narek Galstyan Նարեկ Գալստյան On 29 November 2015 at 20:51, Ndjido Ardo BAR wrote: > Masf, the following link sets the basics to start debugging your spark > apps in local mode: > > > https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940 > > Ardo > > On Sun, Nov 29, 2015 at 5:34 PM, Masf wrote: > >> Hi Ardo >> >> >> Some tutorial to debug with Intellij? >> >> Thanks >> >> Regards. >> Miguel. >> >> >> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR >> wrote: >> >>> hi, >>> >>> IntelliJ is just great for that! >>> >>> cheers, >>> Ardo. >>> >>> On Sun, Nov 29, 2015 at 5:18 PM, Masf wrote: >>> Hi Is it possible to debug spark locally with IntelliJ or another IDE? Thanks -- Regards. Miguel Ángel >>> >>> >> >> >> -- >> >> >> Saludos. >> Miguel Ángel >> > >
Re: get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")
well I do not really need to do it while another job is editing them. I just need to get the names of the folders when I read through textFile("path/to/dir/*/*/*.js") Using *native hadoop* libraries, can I do something like* fs.copy("/my/path/*/*","new/path/")?* Narek Galstyan Նարեկ Գալստյան On 27 October 2015 at 19:13, Deenar Toraskar wrote: > This won't work as you can never guarantee which files were read by Spark > if some other process is writing files to the same location. It would be > far less work to move files matching your pattern to a staging location and > then load them using sc.textFile. you should find hdfs file system calls > that are equivalent to normal file system if command line tools like distcp > or mv don't meet your needs. > On 27 Oct 2015 1:49 p.m., "Նարեկ Գալստեան" wrote: > >> Dear Spark users, >> >> I am reading a set of json files to compile them to Parquet data format. >> I am willing to mark the folders in some way after having read their >> contents so that I do not read it again(e.g. I can changed the name of the >> folder). >> >> I use .textFile("path/to*/dir/*/*/*.js") *technique to* automatically >> *detect >> the files. >> I cannot however, use the same notation* to rename them.* >> >> Could you suggest how I can *get the names of these folders* so that I can >> rename them using native hadoop libraries. >> >> I am using Apache Spark 1.4.1 >> >> I look forward to hearing suggestions!! >> >> yours, >> >> Narek >> >> Նարեկ Գալստյան >> >
get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")
Dear Spark users, I am reading a set of json files to compile them to Parquet data format. I am willing to mark the folders in some way after having read their contents so that I do not read it again(e.g. I can changed the name of the folder). I use .textFile("path/to*/dir/*/*/*.js") *technique to* automatically *detect the files. I cannot however, use the same notation* to rename them.* Could you suggest how I can *get the names of these folders* so that I can rename them using native hadoop libraries. I am using Apache Spark 1.4.1 I look forward to hearing suggestions!! yours, Narek Նարեկ Գալստյան
Interactively search Parquet-stored data using Spark Streaming and DataFrames
I have significant amount of data stored on my Hadoop HDFS as Parquet files I am using Spark streaming to interactively receive queries from a web server and transform the received queries into SQL to run on my data using SparkSQL. In this process I need to run several SQL queries and then return some aggregate result by merging or subtracting the results of individual queries. Are there any ways I could optimize and increase the speed of the process by, for example, running queries on already received dataframes rather than the whole database? Is there a better way to interactively query the Parquet stored data and give results? Thank you! Narek Galstyan Նարեկ Գալստյան