Re: YARN creates only 1 container

2014-05-27 Thread Pradeep Gollakota
I believe it's behaving as expected. It will spawn 64 containers because that's how much memory you have available. The vcores isn't harshly enforced since CPUs can be elastic. This blog from cloudera explain how to enforce CPU limits using CGroups.

Re: Reading sequence file in pig

2014-05-20 Thread Pradeep Gollakota
You can use the SequenceFileLoader from the piggybank. http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/SequenceFileLoader.html On Tue, May 20, 2014 at 2:46 AM, abhishek dodda abhishek.dod...@gmail.comwrote: Hi All, I have trouble building code for this project.

Re: Reading sequence file in pig

2014-05-20 Thread Pradeep Gollakota
) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) On Tue, May 20, 2014 at 5:41 AM, Pradeep Gollakota pradeep...@gmail.comwrote: You can use

Unable to find final method in superclass

2014-05-19 Thread Pradeep Gollakota
Hi All, I’m trying to work with NIO in Java 7, and I’m not able to access methods that are declared in the super class. (.getPath (java.nio.file.FileSystems/getDefault) /) The above code throws the following Exception: Exception in thread main java.lang.IllegalArgumentException: No matching

Re: Query : Filtering out string from a field

2014-05-12 Thread Pradeep Gollakota
Check out http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT This may suit your needs On Mon, May 12, 2014 at 12:16 AM, kartik manocha koolkarti...@gmail.comwrote: Hi, I am new to pig facing an issue in filtering out a string from a field, mentioned is the scenario.

Re: Query : Filtering out string from a field

2014-05-12 Thread Pradeep Gollakota
regex, so that it prints the string before that. Thanks, Kartik On Mon, May 12, 2014 at 2:03 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Check out http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT This may suit your needs On Mon, May 12, 2014 at 12

Re: ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Number

2014-04-24 Thread Pradeep Gollakota
Whats the LoadFunc you're using? On Thu, Apr 24, 2014 at 9:28 AM, Swapnil Shinde swapnilushi...@gmail.comwrote: I am facing very weird problem while multiplication. Pig simplified code snippet- A = LOAD 'file_A' AS (colA1 : double, colA2 : double); describe A; *A: {colA1:

Re: ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Number

2014-04-24 Thread Pradeep Gollakota
might not be cast-able to numeric for one or more records. On 24 April 2014 22:24, Pradeep Gollakota pradeep...@gmail.com wrote: Whats the LoadFunc you're using? On Thu, Apr 24, 2014 at 9:28 AM, Swapnil Shinde swapnilushi...@gmail.com wrote: I am facing very weird

Re: Number of map task

2014-04-22 Thread Pradeep Gollakota
Pig is a little too smart when dealing with data. It has a feature called split combination. If you set it to false, you should see more mappers. SET pig.noSplitCombination true; On Tue, Apr 22, 2014 at 12:14 PM, Patcharee Thongtra patcharee.thong...@uni.no wrote: Hi, I wrote a custom

Re: Strange CROSS behavior

2014-04-18 Thread Pradeep Gollakota
What is the storage func you're using? My guess is that there is some shared state in the Storage func. Take a look at this SO that is dealing with shared state in Stores. http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592. The reason why

Re: Pig script : Need help

2014-04-07 Thread Pradeep Gollakota
That is because you're calling REPLACE on a bag of tuples and not a string. What you would want to do is write a UDF (suggested name JOIN_ON), that takes as an argument a join char and will join all the tuples in the bag by the join char. On Mon, Apr 7, 2014 at 12:31 PM, Krishnan Narayanan

Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11)

Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
Subject: Re: Any way to join two aliases without using CROSS The output I would like to see is (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com wrote: I don't understand what you're trying to do from your

Re: 回复:Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
Unfortunately, the Enumerate UDF from DataFu would not work in this case. The UDF works on Bags and in this case, we want to enumerate a relation. Implementing RANK is a very tricky thing to do correctly. I'm not even sure if it's doable just by using Pig operators, UDFs or macros. Best option is

Re: Unable to add file paths when registering a UDF

2014-03-12 Thread Pradeep Gollakota
According to the docs, It should work. http://pig.apache.org/docs/r0.12.0/basic.html#register Stupid question, but is the path correct? Is it on HDFS or local disk? On Tue, Mar 11, 2014 at 8:36 PM, Anthony Alleven aalle...@iastate.eduwrote: Hello, I am trying to use a User Defined Function

Re: Remote Zookeeper

2014-03-11 Thread Pradeep Gollakota
Is there a firewall thats blocking connections on port 9092? Also, the broker list should be comma separated. On Tue, Mar 11, 2014 at 9:02 AM, A A andthereitg...@hotmail.com wrote: Sorry one of the brokers for was down. Brought it back up. Tried the following

Re: one MR job for group-bys and cube-bys

2014-03-11 Thread Pradeep Gollakota
Best way to examine this is to use the EXPLAIN operator. It will show you the physical MapReduce plan and what features are being executed in each phase. On Tue, Mar 11, 2014 at 11:29 AM, ey-chih Chow eyc...@gmail.com wrote: Hi, I got a question on a pig script that has a single input with

Re: one MR job for group-bys and cube-bys

2014-03-11 Thread Pradeep Gollakota
I forgot to mention that there are also other 3rd party libraries that make examining the physical plan easier. For example take a look at Lipstickhttps://github.com/Netflix/Lipstickfrom Netflix. On Tue, Mar 11, 2014 at 11:41 AM, Pradeep Gollakota pradeep...@gmail.comwrote: Best way to examine

Re: HBase Rowkey Scan Taking more than 10 minutes.

2014-03-08 Thread Pradeep Gollakota
I believe the Prefix filter does a full table scan. What you want to do for fast seeks is provide a 'startKey' and 'endKey'. You can mimic what the prefix filter does by doing startKey = prefix and endKey = prefix~ (~ is the last printable ascii char) On Sat, Mar 8, 2014 at 8:37 AM, Parkirat

Re: Need suggestion, jpa kind of package for hbase

2014-03-06 Thread Pradeep Gollakota
Kundera also has support for HBase as far as I'm aware. On Thu, Mar 6, 2014 at 8:13 PM, jeevi tesh jeevitesh...@gmail.com wrote: Hi all, I'm new to hbase in search of jpa kind of package for hbase to push the data into hbase system. Started trying with stargate where i found very strict

Re: Nested foreach with order by

2014-02-27 Thread Pradeep Gollakota
Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the

Re: Nested foreach with order by

2014-02-27 Thread Pradeep Gollakota
,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote: Where exactly are you getting duplicates? I'm not sure I

Re: New Consumer API discussion

2014-02-13 Thread Pradeep Gollakota
Hi Neha, 6. It seems like #4 can be avoided by using MapTopicPartition, Long or MapTopicPartition, TopicPartitionOffset as the argument type. How? lastCommittedOffsets() is independent of positions(). I'm not sure I understood your suggestion. I think of subscription as you're subscribing

Re: New Consumer API discussion

2014-02-13 Thread Pradeep Gollakota
Hi Neha, 6. It seems like #4 can be avoided by using MapTopicPartition, Long or MapTopicPartition, TopicPartitionOffset as the argument type. How? lastCommittedOffsets() is independent of positions(). I'm not sure I understood your suggestion. I think of subscription as you're subscribing

Re: New Consumer API discussion

2014-02-11 Thread Pradeep Gollakota
feedback on which APIs should have different arguments/return types? 2. lastCommittedOffsets() does what you said in the javadoc. Thanks, Neha On Tue, Feb 11, 2014 at 11:45 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi Jay, I apologize for derailing the conversation about

Re: New Consumer API discussion

2014-02-11 Thread Pradeep Gollakota
do you think? -Jay On Mon, Feb 10, 2014 at 3:37 PM, Pradeep Gollakota pradeep...@gmail.com wrote: WRT to hierarchical topics, I'm referring to KAFKA-1175https://issues.apache.org/jira/browse/KAFKA-1175. I would just like to think through the implications for the Consumer API

Re: Config for new clients (and server)

2014-02-10 Thread Pradeep Gollakota
+1 Jun. On Mon, Feb 10, 2014 at 2:17 PM, Sriram Subramanian srsubraman...@linkedin.com wrote: +1 on Jun's suggestion. On 2/10/14 2:01 PM, Jun Rao jun...@gmail.com wrote: I actually prefer to see those at INFO level. The reason is that the config system in an application can be complex.

Re: New Consumer API discussion

2014-02-10 Thread Pradeep Gollakota
uniquely identifies a partition of a topic Thanks, Neha On Mon, Feb 10, 2014 at 12:36 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Couple of very quick thoughts. 1. +1 about renaming commit(...) and commitAsync(...) 2. I'd also like to extend the above for the poll() method as well

Re: Building a producer/consumer supporting exactly-once messaging

2014-02-10 Thread Pradeep Gollakota
Have you read this part of the documentation? http://kafka.apache.org/documentation.html#semantics Just wondering if that solves your use case. On Mon, Feb 10, 2014 at 9:11 AM, Garry Turkington g.turking...@improvedigital.com wrote: Hi, I've been doing some prototyping on Kafka for a few

Re: New Consumer API discussion

2014-02-10 Thread Pradeep Gollakota
Couple of very quick thoughts. 1. +1 about renaming commit(...) and commitAsync(...) 2. I'd also like to extend the above for the poll() method as well. poll() and pollWithTimeout(long, TimeUnit)? 3. Have you guys given any thought around how this API would be used with hierarchical topics? 4.

Re: Config for new clients (and server)

2014-02-10 Thread Pradeep Gollakota
+1 Jun. On Mon, Feb 10, 2014 at 2:17 PM, Sriram Subramanian srsubraman...@linkedin.com wrote: +1 on Jun's suggestion. On 2/10/14 2:01 PM, Jun Rao jun...@gmail.com wrote: I actually prefer to see those at INFO level. The reason is that the config system in an application can be complex.

Re: New Consumer API discussion

2014-02-10 Thread Pradeep Gollakota
uniquely identifies a partition of a topic Thanks, Neha On Mon, Feb 10, 2014 at 12:36 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Couple of very quick thoughts. 1. +1 about renaming commit(...) and commitAsync(...) 2. I'd also like to extend the above for the poll() method as well

Re: change the Yarn application container memory size when it is running

2014-02-10 Thread Pradeep Gollakota
I'm not sure I understand the use case for something like that. I'm pretty sure the YARN API doesn't support it though. What you might be able to do is to tear down your existing container and request a new one. On Mon, Feb 10, 2014 at 10:28 AM, Thomas Bentsen t...@bentzn.com wrote: I am no

[jira] [Commented] (KAFKA-1226) Rack-Aware replica assignment option

2014-02-07 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895123#comment-13895123 ] Pradeep Gollakota commented on KAFKA-1226: -- [~jvanremoortere] Can you either add

[jira] [Commented] (KAFKA-1226) Rack-Aware replica assignment option

2014-02-07 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895138#comment-13895138 ] Pradeep Gollakota commented on KAFKA-1226: -- [~jvanremoortere] Sweet! Thanks. Can

Coprocessor Client Blocking

2014-01-21 Thread Pradeep Gollakota
Hi All, In the blog describing the coprocessor there was sequence diagram walking through the lifecycle of a Get. https://blogs.apache.org/hbase/mediaresource/60b135e5-04c6-4197-b262-e7cd08de784b I'm wondering if the lifecycle of a Put follows the same sequence. Specifically for my use case, I'm

Re: Coprocessor Client Blocking

2014-01-21 Thread Pradeep Gollakota
Thank you! On Tue, Jan 21, 2014 at 4:52 PM, Ted Yu yuzhih...@gmail.com wrote: bq. Does the client wait until the postPut() is executed? Yes. Please see HRegion#doMiniBatchMutation() In 0.94, it is around line 2521. Cheers On Tue, Jan 21, 2014 at 4:32 PM, Pradeep Gollakota pradeep

Re: how to control nested CROSS parallelism?

2014-01-20 Thread Pradeep Gollakota
It's strange that it's being executed on the Map-side. The group is a reduce side operation (I'm assuming) and it seems that the nested foreach would happen on Reduce-side after grouping. Have you looked at the MR plan to verify that it is being executed Map-side? One thing to try might be to

0.96 Replication to Elasticsearch

2014-01-15 Thread Pradeep Gollakota
Hi All, I have a use case where I need to replicate data from HBase into Elasticsearch. I've found two implementations of an HBase River that accomplishes this. One uses timestamps to do a timerange scan of the table (since last sync) and replicates data across. For many reasons this is not

Re: Spilling issue - Optimize GROUP BY

2014-01-10 Thread Pradeep Gollakota
Did you mean to say timeout instead of spill? Spills don't cause task failures (unless a spill fails). Default timeout for a task is 10 min. It would be very helpful to have a stack trace to look at, at the very least. On Fri, Jan 10, 2014 at 7:53 AM, Zebeljan, Nebojsa

Re: secondary index feature

2013-12-22 Thread Pradeep Gollakota
I lied in my previous email... it doesn't look like Phoenix uses HIndex. On Sun, Dec 22, 2013 at 3:53 PM, Pradeep Gollakota pradeep...@gmail.comwrote: Take a look at this library from Huawei. They went a step further to colocate the index with the primary partition. I believe Phoenix uses

Re: Performance tuning

2013-12-21 Thread Pradeep Gollakota
Do you know if machines 19-23 are on a different rack? It seems to me that your problem might be a networking problem. Whether it is hardware, configuration or something else entirely, I'm not sure. It might be worthwhile to talk to your systems administrator to see why pings to these machines are

Re: Performance tuning

2013-12-21 Thread Pradeep Gollakota
, Pradeep Gollakota pradeep...@gmail.com wrote: Do you know if machines 19-23 are on a different rack? It seems to me that your problem might be a networking problem. Whether it is hardware, configuration or something else entirely, I'm not sure. It might be worthwhile to talk to your systems

Re: Guava 15

2013-12-16 Thread Pradeep Gollakota
This is kinda tangential, but for very very common dependencies such as guava, jackson, etc. would it make sense to use a shaded jar so as not to affect user dependencies? On Mon, Dec 16, 2013 at 7:47 PM, Ted Yu yuzhih...@gmail.com wrote: Please try out patch v2 from HBASE-10174 Thanks On

Re: Log File Versioning and Pig

2013-12-12 Thread Pradeep Gollakota
It seems like what you're asking for is Versioned Schema management. Pig is not designed for that. Pig is only a scripting language to manipulate datasets. I'd recommend you look into Thrift, Protocol Buffers and Avro. They are compact serialization libraries that do versioned schema management.

Re: zookeeper.znode.parent mismatch exception

2013-12-12 Thread Pradeep Gollakota
Did you recently upgrade to 0.96? This is a problem I faced with mismatched clients connecting to an 0.96 cluster. Starting in that version, the root node for zookeeper chanced from /hbase to /hbase-unsecure (if in unsecure mode). On Thu, Dec 12, 2013 at 10:47 PM, Sandeep L

Client API best practices for my use case

2013-12-10 Thread Pradeep Gollakota
Hi All, I'm trying to understand how different configuration will affect performance for my use cases. My table has the following the following schema. I'm storing event logs in a single column family. The row key is in the format [company][timestamp][uuid]. My access pattern is fairly simple.

[jira] [Updated] (KAFKA-1175) Hierarchical Topics

2013-12-09 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Gollakota updated KAFKA-1175: - Issue Type: New Feature (was: Bug) Hierarchical Topics

[jira] [Commented] (KAFKA-1175) Hierarchical Topics

2013-12-09 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843861#comment-13843861 ] Pradeep Gollakota commented on KAFKA-1175: -- I'm very interested in this feature

[jira] [Commented] (KAFKA-1175) Hierarchical Topics

2013-12-09 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844002#comment-13844002 ] Pradeep Gollakota commented on KAFKA-1175: -- In the proposal, [~jkreps] talks

Re: Trouble with REGEX in PIG

2013-12-04 Thread Pradeep Gollakota
It's not valid PigLatin... The Grunt shell doesn't let you try out functions and UDFs are you're trying to use them. A = LOAD 'data' USING PigStorage() as (ip: chararray); B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1); DUMP B; You always have to load a dataset and work

Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Pradeep Gollakota
I tried to following script (not exactly the same) and it worked correctly for me. businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c, business_id: chararray, lat: double, lng: double); locations = FOREACH businesses GENERATE business_id, lat, lng; STORE locations INTO 'locations.tsv';

Re: Apache Pig + Storm integration

2013-12-02 Thread Pradeep Gollakota
Jacob Perkins submitted a POC patch. However, my guess is that this will not be included in the 0.13 release. There's still quite a bit of work to be done and we'll be working on it. You can track the progress at https://issues.apache.org/jira/browse/PIG-3453 On Mon, Dec 2, 2013 at 9:51 AM,

Re: Online/Realtime query with filter and join?

2013-12-02 Thread Pradeep Gollakota
In addition to Impala and Pheonix, I'm going to throw PrestoDB into the mix. :) http://prestodb.io/ On Mon, Dec 2, 2013 at 10:58 AM, Doug Meil doug.m...@explorysmedical.comwrote: You are going to want to figure out a rowkey (or a set of tables with rowkeys) to restrict the number of I/O's.

Re: Online/Realtime query with filter and join?

2013-12-02 Thread Pradeep Gollakota
defines your row key. You should lead with the columns that you'll filter against most frequently. Then, take a look at adding secondary indexes to speedup queries against other columns. Thanks, James On Mon, Dec 2, 2013 at 11:01 AM, Pradeep Gollakota pradeep...@gmail.com wrote

Re: Need help

2013-11-27 Thread Pradeep Gollakota
This question belongs on the user list. The dev list is meant for Pig developers to discuss issues related to the development of Pig. I’ve forwarded this to the user list. It also helps tremendously if you format your data and scripts nicely as they’re much easier to read and understand. I use a

Re: Need help

2013-11-27 Thread Pradeep Gollakota
This question belongs on the user list. The dev list is meant for Pig developers to discuss issues related to the development of Pig. I’ve forwarded this to the user list. It also helps tremendously if you format your data and scripts nicely as they’re much easier to read and understand. I use a

Re: add a key value pair in map

2013-11-15 Thread Pradeep Gollakota
I don't think there's an out of the box solution for it. But it's fairly trivial to do with a UDF On Nov 15, 2013 3:19 PM, Jerry Lam chiling...@gmail.com wrote: Hi Pig users, Do you know how to add a key value pair into a map? For instance, a relation of A contains a document:map[] for each

Re: hbase suitable for churn analysis ?

2013-11-14 Thread Pradeep Gollakota
I'm a little curious as to how you would be able to use no_of_days as a column qualifier at all... it changes everyday for all users right? So how will you keep your table updated? On Thu, Nov 14, 2013 at 9:07 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: You can use your no_day as a

Re: replicated join gets extra job

2013-11-11 Thread Pradeep Gollakota
Use the ILLUSTRATE or EXPLAIN keywords to look at the details of the physical execution plan... from first glance it doesn't look like you'd need a 2nd job to do the joins, but if you can post the output of ILLUSTRATE/EXPLAIN, we can look into it. On Mon, Nov 11, 2013 at 4:36 PM, Dexin Wang

Re: Bag of tuples

2013-11-06 Thread Pradeep Gollakota
Each element in A is not a Bag. A relation is a collection of tuples (just like a bag). So each element in A is a tuple whose first element is a Bag. If you want to order the tuples by id, you have to extract them from the bag first. A = LOAD 'data' ...; B = FOREACH A GENERATE FLATTEN($0); C =

Re: Local vs mapreduce mode

2013-11-05 Thread Pradeep Gollakota
Really dumb question but... when running in MapReduce mode, is your input file on HDFS? On Tue, Nov 5, 2013 at 9:17 AM, Sameer Tilak ssti...@live.com wrote: Dear Pig experts, I have the following Pig script that works perfectly in local mode. However, in the mapreduce mode I get AU as :

Re: Pig Distributed Cache

2013-11-05 Thread Pradeep Gollakota
CROSS is grossly expensive to compute so I’m not surprised that the performance is good enough. Are you repeating your LOAD and FILTER op’s for every one of your small files? At the end of the day, what is it that you’re trying to accomplish? Find the 1 row you’re after and attach to all rows in

Re: Pig Distributed Cache

2013-11-05 Thread Pradeep Gollakota
but originally it stores in different environment. I pull the data from there and load into HDFS. Anyway because of our architecture I can't change it right now. Thanks Best regards... On Tue, Nov 5, 2013 at 7:43 PM, Pradeep Gollakota pradeep...@gmail.com wrote: CROSS is grossly expensive

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-04 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813136#comment-13813136 ] Pradeep Gollakota commented on PIG-3453: [~thedatachef] Wow... This is a great start

Re: Java UDF and incompatible schema

2013-11-04 Thread Pradeep Gollakota
This is most likely because you haven't defined the outputSchema method of the UDF. The AS keyword merges the schema generated by the UDF with the user specified schema. If the UDF does not override the method and specify the output schema, it is considered null and you will not be able to use AS

Re: limit map tasks for load function

2013-11-03 Thread Pradeep Gollakota
I think you’re misunderstanding how HBaseStorage works. HBaseStorage uses the HBaseInputFormat underneath the hood. The number of map tasks that are spawned is dependent on the number of regions you have. The map tasks are spawned such that the tasks are local to the regions they’re reading from.

Re: simple pig logic

2013-10-31 Thread Pradeep Gollakota
If I understood your question correctly, given the following input: main_data.txt {id: foo, some_field: 12354, score: 0} {id: foobar, some_field: 12354, score: 0} {id: baz, some_field: 12345, score: 0} score_data.txt {id: foo, score: 1} {id: foobar, score: 20} you want the following output

Re: UDFContext NULL JobConf

2013-10-30 Thread Pradeep Gollakota
Are you able to post your UDF (or at least a sanitized version)? On Wed, Oct 30, 2013 at 10:46 AM, Henning Kropp henning.kr...@gmail.comwrote: Hi, thanks for your reply. I read about the expected behavior on the front-end and I am getting the NPE on the back-end. The Mappers log the

Re: count distinct on multiple columns

2013-10-29 Thread Pradeep Gollakota
Great question. There seems to be some confusion about how DISTINCT operates. I remembered (and thankfully found) this messagehttp://mail-archives.apache.org/mod_mbox/pig-user/201309.mbox/%3CCAE7pYjar3hX4Kp%2B5SQz3sr%3DvjxfQDVq_6Yi4vh9KgfOj3dzTGw%40mail.gmail.com%3E that explains the behavior. As

Re: Parent Child Relationships in Pig

2013-10-24 Thread Pradeep Gollakota
Not really... In my experience, Pig is only good at dealing with tabular data. The type of graphical data you have is not workable in Pig. Have you considered using a Graph database (such as Neo4j)? These databases are highly optimized for doing the type of path queries you're looking for. On

Re: Attach bag for each tuple and pass to UDF

2013-10-23 Thread Pradeep Gollakota
A replicated cross (implemented as a replicated join on a synthetic key) is probably your best bet. On Wed, Oct 23, 2013 at 2:09 PM, Daniel Dai da...@hortonworks.com wrote: Can you do a cross? On Mon, Oct 21, 2013 at 2:21 PM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, I have two

Re: Need to update dfs.data.dir in Data Nodes. Best approach ?

2013-10-22 Thread Pradeep Gollakota
I think you want to use option 2. It preserves the data thats on those data nodes. P.S: The Hadoop mailing list might be a better list to post this (type) of question on. This is not really HBase specific. On Tue, Oct 22, 2013 at 7:53 AM, satish satishkorit...@gmail.com wrote: Hi All, We

Re: Impala Query problem

2013-10-22 Thread Pradeep Gollakota
Please ask this on the Impala mailing list. This is not an HBase (or Hive) question. On Tue, Oct 22, 2013 at 1:13 AM, Garg, Rinku rinku.g...@fisglobal.comwrote: Hi All, We have installed cludera hadoop-2.0.0-mr1-cdh4.2.0 with hive-0.10.0-cdh4.2.0. Both are working as desired. We can run any

Re: doseq vs dorun

2013-10-18 Thread Pradeep Gollakota
Hi All, Thank you so much for your replies! For my particular use case (tail -f multiple files and write the entries into a db), I'm using pmap to process each file in a separate thread and for each file, I'm using doseq to write to db. It seems to be working well (though I still need to

Re: Elephant-Bird: Building error

2013-10-17 Thread Pradeep Gollakota
This question does not belong in the Pig mailing list. Please ask on the elephant bird mailing list at https://groups.google.com/forum/?fromgroups#!forum/elephantbird-dev On Thu, Oct 17, 2013 at 4:02 PM, Zhu Wayne zhuw.chic...@gmail.com wrote: Why build? Get from maven repo.

doseq vs dorun

2013-10-16 Thread Pradeep Gollakota
Hi All, I’m (very) new to clojure (and loving it)… and I’m trying to wrap my head around how to correctly choose doseq vs dorun for my particular use case. I’ve read this earlier post https://groups.google.com/forum/#!msg/clojure/8ebJsllH8UY/mXtixH3CRRsJ and I had a clarifying question.

Re: even possible?

2013-10-16 Thread Pradeep Gollakota
Don't fix it if it ain't broken =P There shouldn't be any reason why you couldn't change it (back) to the standard way that cloudera distributions are set up. Off the top of my head, I can't think of anything that you're missing. But at the same time, if your cluster is working as is, why change

Re: number of M/R jobs for a Pig Script

2013-10-15 Thread Pradeep Gollakota
I'm not aware of anyway to do that. I think you're also missing the spirit of Pig. Pig is meant to be a data workflow language. Describe a workflow for your data using PigLatin and Pig will then compile your script to MapReduce jobs. The number of MapReduce jobs that it generates is the smallest

Re: number of M/R jobs for a Pig Script

2013-10-15 Thread Pradeep Gollakota
, 2013 at 10:16 AM, Pradeep Gollakota pradeep...@gmail.com wrote: I'm not aware of anyway to do that. I think you're also missing the spirit of Pig. Pig is meant to be a data workflow language. Describe a workflow for your data using PigLatin and Pig will then compile your script to MapReduce

Re: Yarn killing my Application Master

2013-10-14 Thread Pradeep Gollakota
not successfully registered with RM On Fri, Oct 11, 2013 at 3:53 PM, Pradeep Gollakota pradeep...@gmail.comwrote: All, I have a Yarn application that is launching a single container. The container completes successfully but the application fails because the node manager is killing my

Re: State of Art in Hadoop Log aggregation

2013-10-11 Thread Pradeep Gollakota
There are plenty of log aggregation tools both open source and commercial off the shelf. Here's some http://devopsangle.com/2012/04/19/8-splunk-alternatives/ My personal recommendation is LogStash. On Thu, Oct 10, 2013 at 10:38 PM, Raymond Tay raymondtay1...@gmail.comwrote: You can try Chukwa

Re: Improving MR job disk IO

2013-10-10 Thread Pradeep Gollakota
Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013

Re: Improving MR job disk IO

2013-10-10 Thread Pradeep Gollakota
better disk throughput and, (2) CPU load is almost evenly spread across all cores/threads (no CPU gets pegged to 100%). On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota pradeep...@gmail.comwrote: Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're

Re: modify HDFS

2013-10-02 Thread Pradeep Gollakota
Since hadoop 3.0 is 2 major versions higher, it will be significantly different than working with hadoop 1.1.2. The hadoop-1.1 branch is available on SVN at http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1/ On Tue, Oct 1, 2013 at 11:30 PM, Karim Awara

Re: [Discussion] Any thoughts on PIG-3457?

2013-09-30 Thread Pradeep Gollakota
I myself am in favor of the two branch approach. It won't block the 0.12 release and it is easier to maintain. On Mon, Sep 30, 2013 at 12:56 PM, Jeremy Karn jk...@mortardata.com wrote: Ok, sounds good. I'll take a shot at it tonight. On Mon, Sep 30, 2013 at 3:48 PM, Daniel Dai

Re: IncompatibleClassChangeError

2013-09-29 Thread Pradeep Gollakota
I believe it's a difference between the version that your code was compiled against vs the version that you're running against. Make sure that you're not packaging hadoop jar's into your jar and make sure you're compiling against the correct version as well. On Sun, Sep 29, 2013 at 7:27 PM, lei

Re: IncompatibleClassChangeError

2013-09-29 Thread Pradeep Gollakota
, lei liu liulei...@gmail.com wrote: Yes, My job is compiled in CHD3u3, and I run the job on CDH4.3.1, but I use the mr1 of CHD4.3.1 to run the job. What are the different mr1 of cdh4 and mr of cdh3? Thanks, LiuLei 2013/9/30 Pradeep Gollakota pradeep...@gmail.com I believe it's

Re: Reading simple json file

2013-09-23 Thread Pradeep Gollakota
Improper capitalization. Storage functions are case sensitive, try JsonLoader. On Mon, Sep 23, 2013 at 2:37 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, I am trying to read simple json data as: d =LOAD 'json_output' USING JSONLOADER(('ip:chararray,_id:chararray,cats:[chararray]'); But

Re: Help writing a YARN application

2013-09-23 Thread Pradeep Gollakota
, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I've been trying to write a Yarn application and I'm completely lost. I'm using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample code to github at https://github.com/pradeepg26/sample-yarn The problem

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-22 Thread Pradeep Gollakota
, Pradeep Gollakota pradeep...@gmail.com wrote: One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pradeep Gollakota
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key. The SortComparator will tell Hadoop how to sort your data. So you use this if

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pradeep Gollakota
to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format? On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I'm sorry but I don't understand your question

Re: ISOToUNix working in Pig 0.8.1 but not in Pig 0.11.0

2013-09-20 Thread Pradeep Gollakota
Be careful with your format definition... it looks like you might have a typo. I believe -MM-dd hh:mm:ss is the correct format. http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html On Fri, Sep 20, 2013 at 8:26 AM, Ruslan Al-Fakikh metarus...@gmail.comwrote:

Re: ISOToUNix working in Pig 0.8.1 but not in Pig 0.11.0

2013-09-20 Thread Pradeep Gollakota
Doh! I think I made a mistake myself... -MM-dd HH:mm:ss Since you don't have AM/PM, I'm assuming that your time is 24-hr format. So, you need to use the 24 hour format symbol of 'H' for hour instead of 'h'. I really hate time. On Fri, Sep 20, 2013 at 6:25 PM, Pradeep Gollakota pradeep

Help writing a YARN application

2013-09-20 Thread Pradeep Gollakota
Hi All, I've been trying to write a Yarn application and I'm completely lost. I'm using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample code to github at https://github.com/pradeepg26/sample-yarn The problem is that my application master is exiting with a status of 1 (I'm

Re: how to load custom Writable class from sequence file?

2013-09-16 Thread Pradeep Gollakota
The problem is that pig only speaks its data types. So you need to tell it how to translate from your custom writable to a pig datatype. Apparently elephant-bird has some support for doing this type of thing... take a look at this SO post

Re: how to load custom Writable class from sequence file?

2013-09-16 Thread Pradeep Gollakota
fails On Mon, Sep 16, 2013 at 6:22 PM, Pradeep Gollakota pradeep...@gmail.com wrote: The problem is that pig only speaks its data types. So you need to tell it how to translate from your custom writable to a pig datatype. Apparently elephant-bird has some support for doing this type

Re: how to load custom Writable class from sequence file?

2013-09-16 Thread Pradeep Gollakota
to write the converters from your types to Pig data types and pass it into the constructor of the SequenceFileLoader. Hope this helps! On Mon, Sep 16, 2013 at 6:56 PM, Pradeep Gollakota pradeep...@gmail.comwrote: Thats correct... The load ... AS (k:chararray, v:charrary); doesn't actually do what

Re: Sort Order in HBase with Pig/Piglatin in Java

2013-09-13 Thread Pradeep Gollakota
insead. Do you knows whats the better choice? TreeMap or LinkedHashMap? Anyway thanks :) 2013/9/13 Pradeep Gollakota pradeep...@gmail.com Thats a great observation John! The problem is that HBaseStorage maps columns families into a HashMap, so the sort ordering is completely lost. You

<    1   2   3   >