Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Saliya Ekanayake
I haven't worked with datasets but would this help
https://stackoverflow.com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd
?

On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:

> Hi,
>
> I have code that does the following using RDDs,
>
> val outputPartitionCount = 300
> val part = new MyOwnPartitioner(outputPartitionCount)
> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>
> where myRdd is correctly formed as key, value pairs. I am looking convert
> this to use Dataset/Dataframe instead of RDDs, so my question is:
>
> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
> Can I accomplish the partial sort using mapPartitions on the resulting
> partitioned Dataset/Dataframe?
>
> Any thoughts?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>


Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
Thank you, Daniel and Yong!

On Wed, Jan 18, 2017 at 4:56 PM, Daniel Siegmann <
dsiegm...@securityscorecard.io> wrote:

> I am not too familiar with Spark Standalone, so unfortunately I cannot
> give you any definite answer. I do want to clarify something though.
>
> The properties spark.sql.shuffle.partitions and spark.default.parallelism
> affect how your data is split up, which will determine the *total* number
> of tasks, *NOT* the number of tasks being run in parallel. Except of
> course you will never run more tasks in parallel than there are total, so
> if your data is small you might be able to control it via these parameters
> - but that wouldn't typically be how you'd use these parameters.
>
> On YARN as you noted there is spark.executor.instances as well as
> spark.executor.cores, and you'd multiple them to determine the maximum
> number of tasks that would run in parallel on your cluster. But there is no
> guarantee the executors would be distributed evenly across nodes.
>
> Unfortunately I'm not familiar with how this works on Spark Standalone.
> Your expectations seem reasonable to me. Sorry I can't be helpful,
> hopefully someone else will be able to explain exactly how this works.
>



-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
So, I should be using spark.sql.shuffle.partitions to control the
parallelism? Is there there a guide to how to tune this?

Thank you,
Saliya

On Wed, Jan 18, 2017 at 2:01 PM, Yong Zhang <java8...@hotmail.com> wrote:

> spark.sql.shuffle.partitions is not only controlling of the Spark SQL, but
> also in any implementation based on Spark DataFrame.
>
>
> If you are using "spark.ml" package, then most ML libraries in it are
> based on DataFrame. So you shouldn't use "spark.default.parallelism",
> instead of "spark.sql.shuffle.partitions".
>
>
> Yong
>
>
> --
> *From:* Saliya Ekanayake <esal...@gmail.com>
> *Sent:* Wednesday, January 18, 2017 12:33 PM
> *To:* spline_pal...@yahoo.com
> *Cc:* jasbir.s...@accenture.com; User
> *Subject:* Re: Spark #cores
>
> The Spark version I am using is 2.10. The language is Scala. This is
> running in standalone cluster mode.
>
> Each worker is able to use all physical CPU cores in the cluster as is the
> default case.
>
> I was using the following parameters to spark-submit
>
> --conf spark.executor.cores=1 --conf spark.default.parallelism=32
>
> Later, I read that the term "cores" doesn't mean physical CPU cores but
> rather #tasks that an executor can execute.
>
> Anyway, I don't have a clear idea how to set the number of executors per
> physical node. I see there's an option in the Yarn mode, but it's not
> available for standalone cluster mode.
>
> Thank you,
> Saliya
>
> On Wed, Jan 18, 2017 at 12:13 PM, Palash Gupta <spline_pal...@yahoo.com>
> wrote:
>
>> Hi,
>>
>> Can you please share how you are assigning cpu core & tell us spark
>> version and language you are using?
>>
>> //Palash
>>
>> Sent from Yahoo Mail on Android
>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>
>> On Wed, 18 Jan, 2017 at 10:16 pm, Saliya Ekanayake
>> <esal...@gmail.com> wrote:
>> Thank you, for the quick response. No, this is not Spark SQL. I am
>> running the built-in PageRank.
>>
>> On Wed, Jan 18, 2017 at 10:33 AM, <jasbir.s...@accenture.com> wrote:
>>
>>> Are you talking here of Spark SQL ?
>>>
>>> If yes, spark.sql.shuffle.partitions needs to be changed.
>>>
>>>
>>>
>>> *From:* Saliya Ekanayake [mailto:esal...@gmail.com]
>>> *Sent:* Wednesday, January 18, 2017 8:56 PM
>>> *To:* User <user@spark.apache.org>
>>> *Subject:* Spark #cores
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am running a Spark application setting the number of executor cores 1
>>> and a default parallelism of 32 over 8 physical nodes.
>>>
>>>
>>>
>>> The web UI shows it's running on 200 cores. I can't relate this number
>>> to the parameters I've used. How can I control the parallelism in a more
>>> deterministic way?
>>>
>>>
>>>
>>> Thank you,
>>>
>>> Saliya
>>>
>>>
>>>
>>> --
>>>
>>> Saliya Ekanayake, Ph.D
>>>
>>> Applied Computer Scientist
>>>
>>> Network Dynamics and Simulation Science Laboratory (NDSSL)
>>>
>>> Virginia Tech, Blacksburg
>>>
>>>
>>>
>>> --
>>>
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>> by local law, electronic communications with Accenture and its affiliates,
>>> including e-mail and instant messaging (including content), may be scanned
>>> by our systems for the purposes of information security and assessment of
>>> internal compliance with Accenture policy.
>>> __ __
>>> __
>>>
>>> www.accenture.com
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake, Ph.D
>> Applied Computer Scientist
>> Network Dynamics and Simulation Science Laboratory (NDSSL)
>> Virginia Tech, Blacksburg
>>
>>
>
>
> --
> Saliya Ekanayake, Ph.D
> Applied Computer Scientist
> Network Dynamics and Simulation Science Laboratory (NDSSL)
> Virginia Tech, Blacksburg
>
>


-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
The Spark version I am using is 2.10. The language is Scala. This is
running in standalone cluster mode.

Each worker is able to use all physical CPU cores in the cluster as is the
default case.

I was using the following parameters to spark-submit

--conf spark.executor.cores=1 --conf spark.default.parallelism=32

Later, I read that the term "cores" doesn't mean physical CPU cores but
rather #tasks that an executor can execute.

Anyway, I don't have a clear idea how to set the number of executors per
physical node. I see there's an option in the Yarn mode, but it's not
available for standalone cluster mode.

Thank you,
Saliya

On Wed, Jan 18, 2017 at 12:13 PM, Palash Gupta <spline_pal...@yahoo.com>
wrote:

> Hi,
>
> Can you please share how you are assigning cpu core & tell us spark
> version and language you are using?
>
> //Palash
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Wed, 18 Jan, 2017 at 10:16 pm, Saliya Ekanayake
> <esal...@gmail.com> wrote:
> Thank you, for the quick response. No, this is not Spark SQL. I am running
> the built-in PageRank.
>
> On Wed, Jan 18, 2017 at 10:33 AM, <jasbir.s...@accenture.com> wrote:
>
>> Are you talking here of Spark SQL ?
>>
>> If yes, spark.sql.shuffle.partitions needs to be changed.
>>
>>
>>
>> *From:* Saliya Ekanayake [mailto:esal...@gmail.com]
>> *Sent:* Wednesday, January 18, 2017 8:56 PM
>> *To:* User <user@spark.apache.org>
>> *Subject:* Spark #cores
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am running a Spark application setting the number of executor cores 1
>> and a default parallelism of 32 over 8 physical nodes.
>>
>>
>>
>> The web UI shows it's running on 200 cores. I can't relate this number to
>> the parameters I've used. How can I control the parallelism in a more
>> deterministic way?
>>
>>
>>
>> Thank you,
>>
>> Saliya
>>
>>
>>
>> --
>>
>> Saliya Ekanayake, Ph.D
>>
>> Applied Computer Scientist
>>
>> Network Dynamics and Simulation Science Laboratory (NDSSL)
>>
>> Virginia Tech, Blacksburg
>>
>>
>>
>> --
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> __ __
>> __
>>
>> www.accenture.com
>>
>
>
>
> --
> Saliya Ekanayake, Ph.D
> Applied Computer Scientist
> Network Dynamics and Simulation Science Laboratory (NDSSL)
> Virginia Tech, Blacksburg
>
>


-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
Thank you, for the quick response. No, this is not Spark SQL. I am running
the built-in PageRank.

On Wed, Jan 18, 2017 at 10:33 AM, <jasbir.s...@accenture.com> wrote:

> Are you talking here of Spark SQL ?
>
> If yes, spark.sql.shuffle.partitions needs to be changed.
>
>
>
> *From:* Saliya Ekanayake [mailto:esal...@gmail.com]
> *Sent:* Wednesday, January 18, 2017 8:56 PM
> *To:* User <user@spark.apache.org>
> *Subject:* Spark #cores
>
>
>
> Hi,
>
>
>
> I am running a Spark application setting the number of executor cores 1
> and a default parallelism of 32 over 8 physical nodes.
>
>
>
> The web UI shows it's running on 200 cores. I can't relate this number to
> the parameters I've used. How can I control the parallelism in a more
> deterministic way?
>
>
>
> Thank you,
>
> Saliya
>
>
>
> --
>
> Saliya Ekanayake, Ph.D
>
> Applied Computer Scientist
>
> Network Dynamics and Simulation Science Laboratory (NDSSL)
>
> Virginia Tech, Blacksburg
>
>
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> 
> __
>
> www.accenture.com
>



-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


Spark #cores

2017-01-18 Thread Saliya Ekanayake
Hi,

I am running a Spark application setting the number of executor cores 1 and
a default parallelism of 32 over 8 physical nodes.

The web UI shows it's running on 200 cores. I can't relate this number to
the parameters I've used. How can I control the parallelism in a more
deterministic way?

Thank you,
Saliya

-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


Re: Pregel Question

2016-11-22 Thread Saliya Ekanayake
Just realized the attached file has text formatting wrong. The github link
to the file is
https://github.com/esaliya/graphxprimer/blob/master/src/main/scala-2.10/org/saliya/graphxprimer/PregelExample2.scala

On Tue, Nov 22, 2016 at 3:08 PM, Saliya Ekanayake <esal...@gmail.com> wrote:

> Hi,
>
> I've created a graph with vertex data of the form (Int, Array[Int]). I am
> using the pregel operator to update values of the array for each vertex.
>
> So my vprog has the following signature. Note the message is a map of
> arrays from neighbors
>
> def vprog(vertexId: VertexId, value: (Int, Array[Int]), message:
> scala.collection.mutable.HashMap[Int, Array[Int]])
>
> The full program is attached here. The expectation is vprog() to update
> the array and then sendMsg() to send the updates to the neighbors.
>
> However, this requires cloning the vertex every time in the vprog()
> function. If I don't clone Spark would send the same array that it got
> after the initial call.
>
> Is there a way to turn off this caching effect?
>
> Thank you,
> Saliya
>
>
> --
> Saliya Ekanayake, Ph.D
> Applied Computer Scientist
> Network Dynamics and Simulation Science Laboratory (NDSSL)
> Virginia Tech, Blacksburg
>
>


-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


Pregel Question

2016-11-22 Thread Saliya Ekanayake
Hi,

I've created a graph with vertex data of the form (Int, Array[Int]). I am
using the pregel operator to update values of the array for each vertex.

So my vprog has the following signature. Note the message is a map of
arrays from neighbors

def vprog(vertexId: VertexId, value: (Int, Array[Int]), message:
scala.collection.mutable.HashMap[Int, Array[Int]])

The full program is attached here. The expectation is vprog() to update the
array and then sendMsg() to send the updates to the neighbors.

However, this requires cloning the vertex every time in the vprog()
function. If I don't clone Spark would send the same array that it got
after the initial call.

Is there a way to turn off this caching effect?

Thank you,
Saliya


-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


PregelExample2.rtf
Description: RTF file

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

GraphX updating vertex property

2016-11-15 Thread Saliya Ekanayake
Hi,

I have created a property graph using GraphX. Each vertex has an integer
array as a property. I'd like to update the values of theses arrays without
creating new graph objects.

Is this possible in Spark?

Thank you,
Saliya

-- 
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg


Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Thank you, I'll try.

saliya

On Wed, Sep 14, 2016 at 12:07 AM, ayan guha <guha.a...@gmail.com> wrote:

> Depends on join, but unless you are doing cross join, it should not blow
> up. 6M is not too much. I think what you may want to consider (a) volume of
> your data files (b) reduce shuffling by following similar partitioning on
> both RDDs
>
> On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> Thank you, but isn't that join going to be too expensive for this?
>>
>> On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> My suggestion:
>>>
>>> 1. Read first text file in (say) RDD1 using textFile
>>> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
>>> signature (filename,filecontent).
>>> 3. Join RDD1 and 2 based on some file name (or some other key).
>>>
>>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>>
>>>> 1.) What needs to be parallelized is the work for each of those 6M
>>>> rows, not the 80K files. Let me elaborate this with a simple for loop if we
>>>> were to write this serially.
>>>>
>>>> For each line L out of 6M in the first file{
>>>>  process the file corresponding to L out of those 80K files.
>>>> }
>>>>
>>>> The 80K files are in HDFS and to read all that content into each worker
>>>> is not possible due to size.
>>>>
>>>> 2. No. multiple rows may point to rthe same file but they operate on
>>>> different records within the file.
>>>>
>>>> 3. End goal is to write back 6M processed information.
>>>>
>>>> This is simple map only type scenario. One workaround I can think of is
>>>> to append all the 6M records to each of the data files.
>>>>
>>>> Thank you
>>>>
>>>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com>
>>>> wrote:
>>>>
>>>>> Question:
>>>>>
>>>>> 1. Why you can not read all 80K files together? ie, why you have a
>>>>> dependency on first text file?
>>>>> 2. Your first text file has 6M rows, but total number of files~80K. is
>>>>> there a scenario where there may not be a file in HDFS corresponding to 
>>>>> the
>>>>> row in first text file?
>>>>> 3. May be a follow up of 1, what is your end goal?
>>>>>
>>>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The first text file is not that large, it has 6 million records
>>>>>> (lines). For each line I need to read a file out of 8 files. They 
>>>>>> total
>>>>>> around 1.5TB. I didn't understand what you meant by "then again read
>>>>>> text files for each line and union all rdds."
>>>>>>
>>>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>>>>>> raghavendra.pan...@gmail.com> wrote:
>>>>>>
>>>>>>> How large is your first text file? The idea is you read first text
>>>>>>> file and if it is not large you can collect all the lines on driver and
>>>>>>> then again read text files for each line and union all rdds.
>>>>>>>
>>>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Just wonder if this is possible with Spark?
>>>>>>>>
>>>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <
>>>>>>>> esal...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've got a text file where each line is a record. For each record,
>>>>>>>>> I need to process a file in HDFS.
>>>>>>>>>
>>>>>>>>> So if I represent these records as an RDD and invoke a map()
>>>>>>>>> operation on them how can I access the HDFS within that map()? Do I 
>>>>>>>>> have to
>>>>>>>>> create a Spark context within map() or is there a better solution to 
>>>>>

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Thank you, but isn't that join going to be too expensive for this?

On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote:

> My suggestion:
>
> 1. Read first text file in (say) RDD1 using textFile
> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
> signature (filename,filecontent).
> 3. Join RDD1 and 2 based on some file name (or some other key).
>
> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> 1.) What needs to be parallelized is the work for each of those 6M rows,
>> not the 80K files. Let me elaborate this with a simple for loop if we were
>> to write this serially.
>>
>> For each line L out of 6M in the first file{
>>  process the file corresponding to L out of those 80K files.
>> }
>>
>> The 80K files are in HDFS and to read all that content into each worker
>> is not possible due to size.
>>
>> 2. No. multiple rows may point to rthe same file but they operate on
>> different records within the file.
>>
>> 3. End goal is to write back 6M processed information.
>>
>> This is simple map only type scenario. One workaround I can think of is
>> to append all the 6M records to each of the data files.
>>
>> Thank you
>>
>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Question:
>>>
>>> 1. Why you can not read all 80K files together? ie, why you have a
>>> dependency on first text file?
>>> 2. Your first text file has 6M rows, but total number of files~80K. is
>>> there a scenario where there may not be a file in HDFS corresponding to the
>>> row in first text file?
>>> 3. May be a follow up of 1, what is your end goal?
>>>
>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>>
>>>> The first text file is not that large, it has 6 million records
>>>> (lines). For each line I need to read a file out of 8 files. They total
>>>> around 1.5TB. I didn't understand what you meant by "then again read
>>>> text files for each line and union all rdds."
>>>>
>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>>>> raghavendra.pan...@gmail.com> wrote:
>>>>
>>>>> How large is your first text file? The idea is you read first text
>>>>> file and if it is not large you can collect all the lines on driver and
>>>>> then again read text files for each line and union all rdds.
>>>>>
>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Just wonder if this is possible with Spark?
>>>>>>
>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've got a text file where each line is a record. For each record, I
>>>>>>> need to process a file in HDFS.
>>>>>>>
>>>>>>> So if I represent these records as an RDD and invoke a map()
>>>>>>> operation on them how can I access the HDFS within that map()? Do I 
>>>>>>> have to
>>>>>>> create a Spark context within map() or is there a better solution to 
>>>>>>> that?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Saliya
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Saliya Ekanayake
>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>> Indiana University, Bloomington
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Saliya Ekanayake
>>>>>> Ph.D. Candidate | Research Assistant
>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>> Indiana University, Bloomington
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington


Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
1.) What needs to be parallelized is the work for each of those 6M rows,
not the 80K files. Let me elaborate this with a simple for loop if we were
to write this serially.

For each line L out of 6M in the first file{
 process the file corresponding to L out of those 80K files.
}

The 80K files are in HDFS and to read all that content into each worker is
not possible due to size.

2. No. multiple rows may point to rthe same file but they operate on
different records within the file.

3. End goal is to write back 6M processed information.

This is simple map only type scenario. One workaround I can think of is to
append all the 6M records to each of the data files.

Thank you

On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> wrote:

> Question:
>
> 1. Why you can not read all 80K files together? ie, why you have a
> dependency on first text file?
> 2. Your first text file has 6M rows, but total number of files~80K. is
> there a scenario where there may not be a file in HDFS corresponding to the
> row in first text file?
> 3. May be a follow up of 1, what is your end goal?
>
> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> The first text file is not that large, it has 6 million records (lines).
>> For each line I need to read a file out of 8 files. They total around
>> 1.5TB. I didn't understand what you meant by "then again read text files
>> for each line and union all rdds."
>>
>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>> raghavendra.pan...@gmail.com> wrote:
>>
>>> How large is your first text file? The idea is you read first text file
>>> and if it is not large you can collect all the lines on driver and then
>>> again read text files for each line and union all rdds.
>>>
>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> wrote:
>>>
>>>> Just wonder if this is possible with Spark?
>>>>
>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've got a text file where each line is a record. For each record, I
>>>>> need to process a file in HDFS.
>>>>>
>>>>> So if I represent these records as an RDD and invoke a map() operation
>>>>> on them how can I access the HDFS within that map()? Do I have to create a
>>>>> Spark context within map() or is there a better solution to that?
>>>>>
>>>>> Thank you,
>>>>> Saliya
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington


Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
The first text file is not that large, it has 6 million records (lines).
For each line I need to read a file out of 8 files. They total around
1.5TB. I didn't understand what you meant by "then again read text files
for each line and union all rdds."

On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> How large is your first text file? The idea is you read first text file
> and if it is not large you can collect all the lines on driver and then
> again read text files for each line and union all rdds.
>
> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> wrote:
>
>> Just wonder if this is possible with Spark?
>>
>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've got a text file where each line is a record. For each record, I
>>> need to process a file in HDFS.
>>>
>>> So if I represent these records as an RDD and invoke a map() operation
>>> on them how can I access the HDFS within that map()? Do I have to create a
>>> Spark context within map() or is there a better solution to that?
>>>
>>> Thank you,
>>> Saliya
>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington


Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Just wonder if this is possible with Spark?

On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
wrote:

> Hi,
>
> I've got a text file where each line is a record. For each record, I need
> to process a file in HDFS.
>
> So if I represent these records as an RDD and invoke a map() operation on
> them how can I access the HDFS within that map()? Do I have to create a
> Spark context within map() or is there a better solution to that?
>
> Thank you,
> Saliya
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington


Access HDFS within Spark Map Operation

2016-09-11 Thread Saliya Ekanayake
Hi,

I've got a text file where each line is a record. For each record, I need
to process a file in HDFS.

So if I represent these records as an RDD and invoke a map() operation on
them how can I access the HDFS within that map()? Do I have to create a
Spark context within map() or is there a better solution to that?

Thank you,
Saliya



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington