Yep, have several tens of terabytes of data that will easily be over couple
of hundred TB in a year. Now it isn't as if I have one or two use cases to
run on these data sets. I need to run simple aggregation like counting,
averaging to more advanced analytics. I also need to be able to search
through these data sets based on keywords.

The answer right now seems to be duplicate your data based on your problem.
If you are searching, take your data on HDFS and turn it into a lucene
index with Solr or Elasticsearch. Some solutions favour representations
like HBase.





On Mon, Oct 14, 2013 at 8:17 PM, Lance Norskog <goks...@gmail.com> wrote:

>  There are a few reasons to use map/reduce, or just map-only or
> reduce-only jobs.
> 1) You want to do parallel algorithms where data from multiple machines
> have to be cross-checked. Map/Reduce allows this.
> 2) You want to run several instances of a job. Hadoop does this reliably
> by monitoring all instances, restarting failed ones, etc.
> 3) You have way too much data to fit on one computer. Same as #2.
>
> You might not need Hadoop if you can run your programs without it.
>
> Lance
>
>
> On 10/14/2013 08:02 PM, Xuri Nagarin wrote:
>
> Yes, I tested with smaller data sets and the MR job correctly
> reads/matches one line at a time.
>
>
>
>
> On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX <dsui...@rdx.com> wrote:
>
>> So, perhaps this has been thought of, but perhaps not.
>>
>>  It is my understanding that grep is usually sorting things one line at
>> a time. As I am currently experimenting with Avro, I am finding that the
>> local grep function does not handle it well at all, because it is one long
>> line essentially, so working from local Avro, grep does not do well at
>> pattern matching, it just returns the whole file as a match, and it takes a
>> long time to view it in vi editor as well since there are no EOL markers.
>>
>>  If you have modified for sequence file, are you reading a sequence file
>> that has newline characters? If not, perhaps the file is being read as one
>> whole line, causing some unexpected effects.
>>
>>  Thanks,
>>  *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>>  100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Oct 10, 2013 at 4:50 PM, Xuri Nagarin <secs...@gmail.com> wrote:
>>
>>> On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota <pradeep...@gmail.com
>>> > wrote:
>>>
>>>> I don't think it necessarily means that the job is a bad candidate for
>>>> MR. It's a different type of a workload. Hortonworks has a great article on
>>>> the different types of workloads you might see and how that affects your
>>>> provisioning choices at
>>>> http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
>>>>
>>>
>>>  One statement that stood out to me in the link above is "For these
>>> reasons, Hortonworks recommends that you either use the Balanced workload
>>> configuration or invest in a pilot Hadoop cluster and plan to evolve as you
>>> analyze the workload patterns in your environment."
>>>
>>>  Now, this is not a critique/concern of HW but rather of hadoop. Well,
>>> what if my workloads can be both CPU and IO intensive? Do I take the
>>> approach of throw-enough-excess-hardware-just-in-case?
>>>
>>>
>>>>
>>>>  I have not looked at the Grep code so I'm not sure why it's behaving
>>>> the way it is. Still curious that streaming has a higher IO throughput and
>>>> lower CPU usage. It may have to do with the fact that /bin/grep is a native
>>>> implementation and Grep (Hadoop) is probably using Java Pattern/Matcher 
>>>> api.
>>>>
>>>
>>>  The Grep code is from the bundled examples in CDH. I made one line
>>> modification for it to read Sequence files. The streaming job probably does
>>> not have lower CPU utilization but I see that it does even out the CPU
>>> utilization among all the available processors. I guess the native grep
>>> binary threads better than the java MR job?
>>>
>>>  Which brings me to ask - If you have the mapper/reducer functionality
>>> built into a platform specific binary, then won't it always be more
>>> efficient than a java MR job? And, in such cases, am I better off with
>>> streaming than Java MR?
>>>
>>>  Thanks for your responses.
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin <secs...@gmail.com>wrote:
>>>>
>>>>> Thanks Pradeep. Does it mean this job is a bad candidate for MR?
>>>>>
>>>>>  Interestingly, running the cmdline '/bin/grep' under a streaming job
>>>>> provides (1) Much better disk throughput and, (2) CPU load is almost 
>>>>> evenly
>>>>> spread across all cores/threads (no CPU gets pegged to 100%).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <
>>>>> pradeep...@gmail.com> wrote:
>>>>>
>>>>>> Actually... I believe that is expected behavior. Since your CPU is
>>>>>> pegged at 100% you're not going to be IO bound. Typically jobs tend to be
>>>>>> CPU bound or IO bound. If you're CPU bound you expect to see low IO
>>>>>> throughput. If you're IO bound, you expect to see low CPU usage.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <secs...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>  I have a simple Grep job (from bundled examples) that I am running
>>>>>>> on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs 
>>>>>>> with
>>>>>>> HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>>>>>>
>>>>>>>  When I run the Grep job, I notice that CPU gets pegged to 100% on
>>>>>>> multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a
>>>>>>> single disk on each node. So I guess, the cluster is poorly performing 
>>>>>>> in
>>>>>>> terms of disk IO. Running Terasort, I see each disk puts out 25-35
>>>>>>> Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.
>>>>>>>
>>>>>>>  How do I go about re-configuring or re-writing the job to utilize
>>>>>>> maximum disk IO?
>>>>>>>
>>>>>>>  TIA,
>>>>>>>
>>>>>>>  Xuri
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Reply via email to