Re: partitioning json data in spark

2015-12-28 Thread Նարեկ Գալստեան
Well, I could try to do that,
but *partitionBy *method is anyway only supported for Parquet format even
in Spark 1.5.1

Narek

Narek Galstyan

Նարեկ Գալստյան

On 27 December 2015 at 21:50, Ted Yu  wrote:

> Is upgrading to 1.5.x a possibility for you ?
>
> Cheers
>
> On Sun, Dec 27, 2015 at 9:28 AM, Նարեկ Գալստեան 
> wrote:
>
>>
>> http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>  I did try but it all was in vain.
>> It is also explicitly written in api docs that it only supports Parquet.
>>
>> ​
>>
>> Narek Galstyan
>>
>> Նարեկ Գալստյան
>>
>> On 27 December 2015 at 17:52, Igor Berman  wrote:
>>
>>> have you tried to specify format of your output, might be parquet is
>>> default format?
>>> df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path");
>>>
>>> On 27 December 2015 at 15:18, Նարեկ Գալստեան 
>>> wrote:
>>>
>>>> Hey all!
>>>> I am willing to partition *json *data by a column name and store the
>>>> result as a collection of json files to be loaded to another database.
>>>>
>>>> I could use spark's built in *partitonBy *function but it only outputs
>>>> in parquet format which is not desirable for me.
>>>>
>>>> Could you suggest me a way to deal with this problem?
>>>> Narek Galstyan
>>>>
>>>> Նարեկ Գալստյան
>>>>
>>>
>>>
>>
>


Re: partitioning json data in spark

2015-12-27 Thread Նարեկ Գալստեան
http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
 I did try but it all was in vain.
It is also explicitly written in api docs that it only supports Parquet.

​

Narek Galstyan

Նարեկ Գալստյան

On 27 December 2015 at 17:52, Igor Berman  wrote:

> have you tried to specify format of your output, might be parquet is
> default format?
> df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path");
>
> On 27 December 2015 at 15:18, Նարեկ Գալստեան  wrote:
>
>> Hey all!
>> I am willing to partition *json *data by a column name and store the
>> result as a collection of json files to be loaded to another database.
>>
>> I could use spark's built in *partitonBy *function but it only outputs
>> in parquet format which is not desirable for me.
>>
>> Could you suggest me a way to deal with this problem?
>> Narek Galstyan
>>
>> Նարեկ Գալստյան
>>
>
>


partitioning json data in spark

2015-12-27 Thread Նարեկ Գալստեան
Hey all!
I am willing to partition *json *data by a column name and store the result
as a collection of json files to be loaded to another database.

I could use spark's built in *partitonBy *function but it only outputs in
parquet format which is not desirable for me.

Could you suggest me a way to deal with this problem?
Narek Galstyan

Նարեկ Գալստյան


Re: Debug Spark

2015-11-29 Thread Նարեկ Գալստեան
A question regarding the topic,

I am using Intellij to write spark applications and then have to ship the
source code to my cluster on the cloud to compile and test

is there a way to automatise the process using Intellij?

Narek Galstyan

Նարեկ Գալստյան

On 29 November 2015 at 20:51, Ndjido Ardo BAR  wrote:

> Masf, the following link sets the basics to start debugging your spark
> apps in local mode:
>
>
> https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940
>
> Ardo
>
> On Sun, Nov 29, 2015 at 5:34 PM, Masf  wrote:
>
>> Hi Ardo
>>
>>
>> Some tutorial to debug with Intellij?
>>
>> Thanks
>>
>> Regards.
>> Miguel.
>>
>>
>> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR 
>> wrote:
>>
>>> hi,
>>>
>>> IntelliJ is just great for that!
>>>
>>> cheers,
>>> Ardo.
>>>
>>> On Sun, Nov 29, 2015 at 5:18 PM, Masf  wrote:
>>>
 Hi

 Is it possible to debug spark locally with IntelliJ or another IDE?

 Thanks

 --
 Regards.
 Miguel Ángel

>>>
>>>
>>
>>
>> --
>>
>>
>> Saludos.
>> Miguel Ángel
>>
>
>


Re: get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")

2015-10-27 Thread Նարեկ Գալստեան
well I do not really need to do it while another job is editing them.
I just need to get the names of the folders when I read through
textFile("path/to/dir/*/*/*.js")

Using *native hadoop* libraries, can I do something like*
fs.copy("/my/path/*/*","new/path/")?*



Narek Galstyan

Նարեկ Գալստյան

On 27 October 2015 at 19:13, Deenar Toraskar 
wrote:

> This won't work as you can never guarantee which files were read by Spark
> if some other process is writing files to the same location. It would be
> far less work to move files matching your pattern to a staging location and
> then load them using sc.textFile. you should find hdfs file system calls
> that are equivalent to normal file system if command line tools like distcp
> or mv don't meet your needs.
> On 27 Oct 2015 1:49 p.m., "Նարեկ Գալստեան"  wrote:
>
>> Dear Spark users,
>>
>> I am reading a set of json files to compile them to Parquet data format.
>> I am willing to mark the folders in some way after having read their
>> contents so that I do not read it again(e.g. I can changed the name of the
>> folder).
>>
>> I use .textFile("path/to*/dir/*/*/*.js") *technique to* automatically
>> *detect
>> the files.
>> I cannot however, use the same notation* to rename them.*
>>
>> Could you suggest how I can *get the names of these folders* so that I can
>> rename them using native hadoop libraries.
>>
>> I am using Apache Spark 1.4.1
>>
>> I look forward to hearing suggestions!!
>>
>> yours,
>>
>> Narek
>>
>> Նարեկ Գալստյան
>>
>


get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")

2015-10-27 Thread Նարեկ Գալստեան
Dear Spark users,

I am reading a set of json files to compile them to Parquet data format.
I am willing to mark the folders in some way after having read their
contents so that I do not read it again(e.g. I can changed the name of the
folder).

I use .textFile("path/to*/dir/*/*/*.js") *technique to* automatically
*detect
the files.
I cannot however, use the same notation* to rename them.*

Could you suggest how I can *get the names of these folders* so that I can
rename them using native hadoop libraries.

I am using Apache Spark 1.4.1

I look forward to hearing suggestions!!

yours,

Narek

Նարեկ Գալստյան


Interactively search Parquet-stored data using Spark Streaming and DataFrames

2015-09-28 Thread Նարեկ Գալստեան
I have significant amount of data stored on my Hadoop HDFS as Parquet files
I am using Spark streaming to interactively receive queries from a web
server and transform the received queries into SQL to run on my data using
SparkSQL.

In this process I need to run several SQL queries and then return some
aggregate result by merging or subtracting the results of individual
queries.

Are there any ways I could optimize and increase the speed of the process
by, for example, running queries on already received dataframes rather than
the whole database?

Is there a better way to interactively query the Parquet stored data and
give results?

Thank you!



Narek Galstyan

Նարեկ Գալստյան