Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
Sure. You can try pySpark, which is the Python API of Spark.
> On May 20, 2016, at 06:20, ayan guha  wrote:
> 
> Hi
> 
> Thanks for the input. Can it be possible to write it in python? I think I can 
> use FileUti.untar from hdfs jar. But can I do it from python?
> 
> On 19 May 2016 16:57, "Sun Rui"  > wrote:
> 1. create a temp dir on HDFS, say “/tmp”
> 2. write a script to create in the temp dir one file for each tar file. Each 
> file has only one line:
> 
> 3. Write a spark application. It is like:
>   val rdd = sc.textFile ()
>   rdd.map { line =>
>construct an untar command using the path information in “line” and 
> launches the command
>   }
> 
> > On May 19, 2016, at 14:42, ayan guha  > > wrote:
> >
> > Hi
> >
> > I have few tar files in HDFS in a single folder. each file has multiple 
> > files in it.
> >
> > tar1:
> >   - f1.txt
> >   - f2.txt
> > tar2:
> >   - f1.txt
> >   - f2.txt
> >
> > (each tar file will have exact same number of files, same name)
> >
> > I am trying to find a way (spark or pig) to extract them to their own 
> > folders.
> >
> > f1
> >   - tar1_f1.txt
> >   - tar2_f1.txt
> > f2:
> >- tar1_f2.txt
> >- tar1_f2.txt
> >
> > Any help?
> >
> >
> >
> > --
> > Best Regards,
> > Ayan Guha
> 
> 



Re: Tar File: On Spark

2016-05-19 Thread Ted Yu
See http://memect.co/call-java-from-python-so

You can also use Py4J

On Thu, May 19, 2016 at 3:20 PM, ayan guha  wrote:

> Hi
>
> Thanks for the input. Can it be possible to write it in python? I think I
> can use FileUti.untar from hdfs jar. But can I do it from python?
> On 19 May 2016 16:57, "Sun Rui"  wrote:
>
>> 1. create a temp dir on HDFS, say “/tmp”
>> 2. write a script to create in the temp dir one file for each tar file.
>> Each file has only one line:
>> 
>> 3. Write a spark application. It is like:
>>   val rdd = sc.textFile ()
>>   rdd.map { line =>
>>construct an untar command using the path information in “line”
>> and launches the command
>>   }
>>
>> > On May 19, 2016, at 14:42, ayan guha  wrote:
>> >
>> > Hi
>> >
>> > I have few tar files in HDFS in a single folder. each file has multiple
>> files in it.
>> >
>> > tar1:
>> >   - f1.txt
>> >   - f2.txt
>> > tar2:
>> >   - f1.txt
>> >   - f2.txt
>> >
>> > (each tar file will have exact same number of files, same name)
>> >
>> > I am trying to find a way (spark or pig) to extract them to their own
>> folders.
>> >
>> > f1
>> >   - tar1_f1.txt
>> >   - tar2_f1.txt
>> > f2:
>> >- tar1_f2.txt
>> >- tar1_f2.txt
>> >
>> > Any help?
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Ayan Guha
>>
>>
>>


Re: Tar File: On Spark

2016-05-19 Thread ayan guha
Hi

Thanks for the input. Can it be possible to write it in python? I think I
can use FileUti.untar from hdfs jar. But can I do it from python?
On 19 May 2016 16:57, "Sun Rui"  wrote:

> 1. create a temp dir on HDFS, say “/tmp”
> 2. write a script to create in the temp dir one file for each tar file.
> Each file has only one line:
> 
> 3. Write a spark application. It is like:
>   val rdd = sc.textFile ()
>   rdd.map { line =>
>construct an untar command using the path information in “line” and
> launches the command
>   }
>
> > On May 19, 2016, at 14:42, ayan guha  wrote:
> >
> > Hi
> >
> > I have few tar files in HDFS in a single folder. each file has multiple
> files in it.
> >
> > tar1:
> >   - f1.txt
> >   - f2.txt
> > tar2:
> >   - f1.txt
> >   - f2.txt
> >
> > (each tar file will have exact same number of files, same name)
> >
> > I am trying to find a way (spark or pig) to extract them to their own
> folders.
> >
> > f1
> >   - tar1_f1.txt
> >   - tar2_f1.txt
> > f2:
> >- tar1_f2.txt
> >- tar1_f2.txt
> >
> > Any help?
> >
> >
> >
> > --
> > Best Regards,
> > Ayan Guha
>
>
>


Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
1. create a temp dir on HDFS, say “/tmp”
2. write a script to create in the temp dir one file for each tar file. Each 
file has only one line:

3. Write a spark application. It is like:
  val rdd = sc.textFile ()
  rdd.map { line =>
   construct an untar command using the path information in “line” and 
launches the command
  }

> On May 19, 2016, at 14:42, ayan guha  wrote:
> 
> Hi
> 
> I have few tar files in HDFS in a single folder. each file has multiple files 
> in it. 
> 
> tar1:
>   - f1.txt
>   - f2.txt
> tar2:
>   - f1.txt
>   - f2.txt
> 
> (each tar file will have exact same number of files, same name)
> 
> I am trying to find a way (spark or pig) to extract them to their own 
> folders. 
> 
> f1
>   - tar1_f1.txt
>   - tar2_f1.txt
> f2:
>- tar1_f2.txt
>- tar1_f2.txt
> 
> Any help? 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Tar File: On Spark

2016-05-19 Thread ayan guha
Hi

I have few tar files in HDFS in a single folder. each file has multiple
files in it.

tar1:
  - f1.txt
  - f2.txt
tar2:
  - f1.txt
  - f2.txt

(each tar file will have exact same number of files, same name)

I am trying to find a way (spark or pig) to extract them to their own
folders.

f1
  - tar1_f1.txt
  - tar2_f1.txt
f2:
   - tar1_f2.txt
   - tar1_f2.txt

Any help?



-- 
Best Regards,
Ayan Guha