[DISCUSS] JSON Canonical Extension Type

2022-11-17 Thread Pradeep Gollakota
Hi folks! I put together this specification for canonicalizing the JSON type in Arrow. ## Introduction JSON is a widely used text based data interchange format. There are many use cases where a user has a column whose contents are a JSON encoded string. BigQuery's [JSON Type][1] and Parquet’s

[jira] [Created] (ARROW-17255) Support JSON logical type in Arrow

2022-07-29 Thread Pradeep Gollakota (Jira)
Pradeep Gollakota created ARROW-17255: - Summary: Support JSON logical type in Arrow Key: ARROW-17255 URL: https://issues.apache.org/jira/browse/ARROW-17255 Project: Apache Arrow Issue

[jira] [Created] (ARROW-17255) Support JSON logical type in Arrow

2022-07-29 Thread Pradeep Gollakota (Jira)
Pradeep Gollakota created ARROW-17255: - Summary: Support JSON logical type in Arrow Key: ARROW-17255 URL: https://issues.apache.org/jira/browse/ARROW-17255 Project: Apache Arrow Issue

Re: Request for PR Review

2018-01-30 Thread Pradeep Gollakota
Gentle thread bump. On Thu, Jan 18, 2018 at 4:03 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > Hi All, > > Can one of you review my PR at https://github.com/apache/ > parquet-mr/pull/447 please? > > Thanks, > Pradeep >

Request for PR Review

2018-01-18 Thread Pradeep Gollakota
Hi All, Can one of you review my PR at https://github.com/apache/parquet-mr/pull/447 please? Thanks, Pradeep

Re: Kafka with Zookeeper behind AWS ELB

2017-07-20 Thread Pradeep Gollakota
Luigi, I strongly urge you to consider a 5 node ZK deployment. I've always done that in the past for resiliency during maintenance. In a 3 node cluster, you can only tolerate one "failure", so if you bring one node down for maintenance and another node crashes during said maintenance, your ZK

Re: Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All, It appears that the bottleneck in my job was the EBS volumes. Very high i/o wait times across the cluster. I was only using 1 volume. Increasing to 4 made it faster. Thanks, Pradeep On Thu, Apr 20, 2017 at 3:12 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > Hi All, &

Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All, I have a simple ETL job that reads some data, shuffles it and writes it back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0. After Stage 0 completes and the job starts Stage 1, I see a huge slowdown in the job. The CPU usage is low on the cluster, as is the network I/O. >From

Re: Scaling up kafka consumers

2017-02-24 Thread Pradeep Gollakota
A single partition can be consumed by at most a single consumer. Consumers compete to take ownership of a partition. So, in order to gain parallelism you need to add more partitions. There is a library that allows multiple consumers to consume from a single partition

Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
ri, Feb 10, 2017 at 10:17 AM, Lars Volker <l...@cloudera.com> wrote: > Can you check the value of ParquetMetaData.created_by? Once you have that, > you should see if it gets filtered by the code in CorruptStatistics.java. > > On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollak

Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Pradeep Gollakota
ed by the > consumer need to be handled by some other group members." > > So does this mean that the consumer should inform the group ahead of > time before it goes down? Currently, I just shutdown the process. > > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota <pr

Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
statistics are not written to the footer? If you > used parquet-mr, they may be there but be ignored. > > Cheers, Lars > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <pradeep...@gmail.com> > wrote: > > > Bumping the thread to see if I get any responses. &g

Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Pradeep Gollakota
I asked a similar question a while ago. There doesn't appear to be a way to not triggering the rebalance. But I'm not sure why it would be taking > 1hr in your case. For us it was pretty fast. https://www.mail-archive.com/users@kafka.apache.org/msg23925.html On Fri, Feb 10, 2017 at 4:28 AM,

Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
Bumping the thread to see if I get any responses. On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > Hi folks, > > I generated a bunch of parquet files using spark and > ParquetThriftOutputFormat. The thirft model has a column called "device

Missing min/max statistics in file footer

2017-02-08 Thread Pradeep Gollakota
Hi folks, I generated a bunch of parquet files using spark and ParquetThriftOutputFormat. The thirft model has a column called "deviceId" which is a string column. It also has a "timestamp" column of int64. After the files have been generated, I inspected the file footers and noticed that only

Re: Unable to compile thrift

2017-02-08 Thread Pradeep Gollakota
Volker <l...@cloudera.com> wrote: > I remember trying to compile with the latest version of thrift shipped in > Ubuntu 14.04 a few weeks back and got the same error. Using 0.7 worked > though. Sadly I don't know why it fails on a Mac. > > On Feb 8, 2017 21:18, "P

Re: Unable to compile thrift

2017-02-08 Thread Pradeep Gollakota
0 -- let us know if you have issues with > these > > Thanks > Wes > > On Wed, Feb 8, 2017 at 2:19 PM, Pradeep Gollakota <pradeep...@gmail.com> > wrote: > > Hi folks, > > > > I'm trying to build parquet from source. However, the instructions

[jira] [Created] (PARQUET-869) Min/Max record counts for block size checks are not configurable

2017-02-07 Thread Pradeep Gollakota (JIRA)
Pradeep Gollakota created PARQUET-869: - Summary: Min/Max record counts for block size checks are not configurable Key: PARQUET-869 URL: https://issues.apache.org/jira/browse/PARQUET-869 Project

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
Usually this kind of thing can be done at a lower level in the InputFormat usually by specifying the max split size. Have you looked into that possibility with your InputFormat? On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote: > Hi Jasbir, > > Yes, you are right. Do you have

Re: Consumer Rebalancing Question

2017-01-06 Thread Pradeep Gollakota
and reassigns it to another member of > the group. This happens once and then the "issue" is resolved without any > additional interruptions. > > -Ewen > > On Thu, Jan 5, 2017 at 3:01 PM, Pradeep Gollakota <pradeep...@gmail.com> > wrote: > &

Consumer Rebalancing Question

2017-01-04 Thread Pradeep Gollakota
Hi Kafka folks! When a consumer is closed, it will issue a LeaveGroupRequest. Does anyone know how long the coordinator waits before reassigning the partitions that were assigned to the leaving consumer to a new consumer? I ask because I'm trying to understand the behavior of consumers if you're

Re: Spark Website

2016-07-13 Thread Pradeep Gollakota
Worked for me if I go to https://spark.apache.org/site/ but not https://spark.apache.org On Wed, Jul 13, 2016 at 11:48 AM, Maurin Lenglart wrote: > Same here > > > > *From: *Benjamin Kim > *Date: *Wednesday, July 13, 2016 at 11:47 AM > *To: *manish

Re: kafka + autoscaling groups fuckery

2016-06-28 Thread Pradeep Gollakota
Just out of curiosity, if you guys are in AWS for everything, why not use Kinesis? On Tue, Jun 28, 2016 at 3:49 PM, Charity Majors wrote: > Hi there, > > I just finished implementing kafka + autoscaling groups in a way that made > sense to me. I have a _lot_ of experience

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated list. I haven't tried this, but I think you should just be able to do sc.textFile("file1,file2,...") On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: > I know these workaround, but wouldn't it be more

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/ On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang <zjf...@gmail.com> wrote: > Yes, that's what I suggest. TextInputFormat support multiple inputs. So in > spark side, we just need to provide API to for that. > > On Thu, Nov 12, 2015 a

Re: Datacenter to datacenter over the open internet

2015-10-06 Thread Pradeep Gollakota
At Lithium, we have multiple datacenters and we distcp our data across our Hadoop clusters. We have 2 DCs in NA and 1 in EU. We have a non-redundant direct connect from our EU cluster to one of our NA DCs. If and when this fails, we have automatic failover to a VPN that goes over the internet. The

Re: Dealing with large messages

2015-10-06 Thread Pradeep Gollakota
t 2015 02:02, "James Cheng" <jch...@tivo.com> wrote: > > > > > >> Here’s an article that Gwen wrote earlier this year on handling large > > >> messages in Kafka. > > >> > > >> http://ingest.tips/2015/01/21/handling-large-message

Dealing with large messages

2015-10-05 Thread Pradeep Gollakota
Fellow Kafkaers, We have a pretty heavyweight legacy event logging system for batch processing. We're now sending the events into Kafka now for realtime analytics. But we have some pretty large messages (> 40 MB). I'm wondering if any of you have use cases where you have to send large messages

Re: number of topics given many consumers and groups within the data

2015-09-30 Thread Pradeep Gollakota
To add a little more context to Shaun's question, we have around 400 customers. Each customer has a stream of events. Some customers generate a lot of data while others don't. We need to ensure that each customer's data is sorted globally by timestamp. We have two use cases around consumption:

CombineHiveInputFormat not working

2015-09-30 Thread Pradeep Gollakota
Hi all, I have an external table of with the following DDL. ``` DROP TABLE IF EXISTS raw_events; CREATE EXTERNAL TABLE IF NOT EXISTS raw_events ( raw_event_string string) PARTITIONED BY (dc string, community string, dt string) STORED AS TEXTFILE LOCATION

Re: CombineHiveInputFormat not working

2015-09-30 Thread Pradeep Gollakota
iveInputFormat not working > > > > what are your values for: > > mapred.min.split.size > > mapred.max.split.size > > hive.hadoop.supports.splittable.combineinputformat > > > > > > *From:* Pradeep Gollakota [mailto:pradeep...@gmail.com] > *Sent:* Wed

Re: CombineHiveInputFormat not working

2015-09-30 Thread Pradeep Gollakota
n your hadoop distro and version, be potentially aware of > > https://issues.apache.org/jira/browse/MAPREDUCE-1597 > > and > > https://issues.apache.org/jira/browse/MAPREDUCE-5537 > > > > test it and see... > > > > *From:* Pradeep Gollakota [mailto:pradeep.

Re: Very slow dynamic partition load

2015-06-11 Thread Pradeep Gollakota
actual partitions in the table but simply partitioned data in hdfs give it a shot. It may be worthwhile looking into optimizations for this use case. -Slava On Thu, Jun 11, 2015 at 11:56 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I have a table which is partitioned on two

Re: Very slow dynamic partition load

2015-06-11 Thread Pradeep Gollakota
I actually decided to remove one of my 2 partition columns and make it a bucketing column instead... same query completed fully in under 10 minutes with 92 partitions added. This will suffice for me for now. On Thu, Jun 11, 2015 at 2:25 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Hmm

Very slow dynamic partition load

2015-06-11 Thread Pradeep Gollakota
Hi All, I have a table which is partitioned on two columns (customer, date). I'm loading some data into the table using a Hive query. The MapReduce job completed within a few minutes and needs to commit the data to the appropriate partitions. There were about 32000 partitions generated. The

Re: HCatInputFormat combine splits

2015-05-14 Thread Pradeep Gollakota
:37 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I'm writing an MR job to read data using HCatInputFormat... however, the job is generating too many splits. I don't have this problem when running queries in Hive since it combines splits by default. Is there an equivalent in MR

HCatInputFormat combine splits

2015-05-14 Thread Pradeep Gollakota
Hi All, I'm writing an MR job to read data using HCatInputFormat... however, the job is generating too many splits. I don't have this problem when running queries in Hive since it combines splits by default. Is there an equivalent in MR so that I'm not generating thousands of mappers? Thanks,

Re: How to stop a mapreduce job from terminal running on Hadoop Cluster?

2015-04-12 Thread Pradeep Gollakota
Also, mapred job -kill job_id On Sun, Apr 12, 2015 at 11:07 AM, Shahab Yunus shahab.yu...@gmail.com wrote: You can kill t by using the following yarn command yarn application -kill application id https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html Or use

Re: integrate Camus and Hive?

2015-03-09 Thread Pradeep Gollakota
If I understood your question correctly, you want to be able to read the output of Camus in Hive and be able to know partition values. If my understanding is right, you can do so by using the following. Hive provides the ability to provide custom patterns for partitions. You can use this in

Re: JIRA attack!

2015-02-08 Thread Pradeep Gollakota
Apparently I joined this list at the right time :P On Sat, Feb 7, 2015 at 4:40 PM, Jay Kreps jay.kr...@gmail.com wrote: I closed about 350 redundant or obsolete issues. If I closed an issue you think is not obsolete, my apologies, just reopen. -Jay

[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-06 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310241#comment-14310241 ] Pradeep Gollakota commented on KAFKA-1884: -- [~guozhang] That's what I figured

[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-06 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310548#comment-14310548 ] Pradeep Gollakota commented on KAFKA-1884: -- I guess that makes sense... I'll

[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-05 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308670#comment-14308670 ] Pradeep Gollakota commented on KAFKA-1884: -- What makes the behavior in #2 earlier

[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-05 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308539#comment-14308539 ] Pradeep Gollakota commented on KAFKA-1884: -- I'd like to work on this. Please

Re: [kafka-clients] Re: [VOTE] 0.8.2.0 Candidate 3

2015-02-03 Thread Pradeep Gollakota
Lithium Technologies would love to host you guys for a release party in SF if you guys want. :) On Tue, Feb 3, 2015 at 11:04 AM, Gwen Shapira gshap...@cloudera.com wrote: When's the party? :) On Mon, Feb 2, 2015 at 8:13 PM, Jay Kreps jay.kr...@gmail.com wrote: Yay! -Jay On Mon,

Re: [kafka-clients] Re: [VOTE] 0.8.2.0 Candidate 3

2015-02-03 Thread Pradeep Gollakota
Lithium Technologies would love to host you guys for a release party in SF if you guys want. :) On Tue, Feb 3, 2015 at 11:04 AM, Gwen Shapira gshap...@cloudera.com wrote: When's the party? :) On Mon, Feb 2, 2015 at 8:13 PM, Jay Kreps jay.kr...@gmail.com wrote: Yay! -Jay On Mon,

Re: Hive - regexp_replace function for multiple strings

2015-02-03 Thread Pradeep Gollakota
I don't think this is doable using the out of the box regexp_replace() UDF. That way I would do it, is using a file to create a mapping between a regexp and it's replacement and write a custom UDF that loads this file and applies all regular expressions on the input. Hope this helps. On Tue, Feb

Re: New Producer - ONLY sync mode?

2015-02-02 Thread Pradeep Gollakota
This is a great question Otis. Like Gwen said, you can accomplish Sync mode by setting the batch size to 1. But this does highlight a shortcoming of the new producer API. I really like the design of the new API and it has really great properties and I'm enjoying working with it. However, once API

Re: New Producer - ONLY sync mode?

2015-02-02 Thread Pradeep Gollakota
it to work? Gwen On Mon, Feb 2, 2015 at 1:38 PM, Pradeep Gollakota pradeep...@gmail.com wrote: This is a great question Otis. Like Gwen said, you can accomplish Sync mode by setting the batch size to 1. But this does highlight a shortcoming of the new producer API. I really like the design

Re: Kafka ETL Camus Question

2015-02-02 Thread Pradeep Gollakota
Hi Bhavesh, At Lithium, we don't run Camus in our pipelines yet, though we plan to. But I just wanted to comment regarding speculative execution. We have it disabled at the cluster level and typically don't need it for most of our jobs. Especially with something like Camus, I don't see any need

Re: and Comparison Operators not working

2015-02-02 Thread Pradeep Gollakota
with the CSV where the f1 is stored as a string ?The CSV data would look like this - *10,abc20,xyz30,lmn... etc *** Thanks,Amit On Monday, February 2, 2015 3:37 AM, Pradeep Gollakota pradeep

Re: and Comparison Operators not working

2015-02-02 Thread Pradeep Gollakota
Just to clarify, do you have a semicolon after f1 20? A = LOAD 'data' USING PigStorage(','); B = FOREACH A GENERATE f1; C = FILTER B BY f1 20; DUMP C; This should be correct. ​ On Sun, Feb 1, 2015 at 4:50 PM, Amit am...@yahoo.com.invalid wrote: Hello,I am trying to run a Ad-hoc pig script

Pattern for pig/hive setup files

2015-01-27 Thread Pradeep Gollakota
Hi All, I'm trying to establish a good pattern and practice with Oozie for sharing a setup file for pig/hive. For example, I have several scripts that use a set of UDFs that are built in-house. In order to use the UDF, I need to add the jar file, and then register the UDF. Rather than repeating

Re: solr indexing using pig script

2015-01-16 Thread Pradeep Gollakota
Viswanath On 16-Jan-2015, at 12:34, Pradeep Gollakota pradeep...@gmail.com wrote: Just out of curiosity, why are you using SET to set the solr collection? I'm not sure if you're using an out of the box Load/Store Func, but if I were to design it, I would use the location of a Load/Store

Re: Is ther a way to run one test of special unit test?

2015-01-16 Thread Pradeep Gollakota
If you're using maven AND using surefire plugin 2.7.3+ AND using Junit 4, then you can do this by specifying -Dtest=TestClass#methodName ref: http://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html On Thu, Jan 15, 2015 at 8:02 PM, Cheolsoo Park piaozhe...@gmail.com

Re: solr indexing using pig script

2015-01-15 Thread Pradeep Gollakota
Just out of curiosity, why are you using SET to set the solr collection? I'm not sure if you're using an out of the box Load/Store Func, but if I were to design it, I would use the location of a Load/Store Func to specify which solr collection to write to. Is it possible for you to redesign this

Re: Is ther a way to run one test of special unit test?

2015-01-15 Thread Pradeep Gollakota
If you're using maven AND using surefire plugin 2.7.3+ AND using Junit 4, then you can do this by specifying -Dtest=TestClass#methodName ref: http://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html On Thu, Jan 15, 2015 at 8:02 PM, Cheolsoo Park piaozhe...@gmail.com

Re: [akka-user] BalancingPool with custom mailbox

2015-01-05 Thread Pradeep Gollakota
is not possible with Akka 2.3.x. It might be supported later, see https://github.com/akka/akka/issues/13961 and https://github.com/akka/akka/issues/13964 Regards, Patrik On Tue, Dec 30, 2014 at 8:58 PM, Pradeep Gollakota prade...@gmail.com javascript: wrote: Hi All, I’m trying to create

[akka-user] BalancingPool with custom mailbox

2014-12-30 Thread Pradeep Gollakota
Hi All, I’m trying to create an ActorSystem where a set of actors have a shared mailbox that’s prioritized. I’ve tested my mailbox without using the BalancingPool router, and the messages are correctly prioritized. However, when I try to create the actors using BalancingPool, the messages

Re: Max. storage for Kafka and impact

2014-12-19 Thread Pradeep Gollakota
@Joe, Achanta is using Indian English numerals which is why it's a little confusing. http://en.wikipedia.org/wiki/Indian_English#Numbering_system 1,00,000 [1 lakh] (Indian English) == 100,000 [1 hundred thousand] (The rest of the world :P) On Fri Dec 19 2014 at 9:40:29 AM Achanta Vamsi Subhash

Re: Efficient use of buffered writes in a post-HTablePool world?

2014-12-19 Thread Pradeep Gollakota
Hi Aaron, Just out of curiosity, have you considered using asynchbase? https://github.com/OpenTSDB/asynchbase On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk ndimi...@apache.org wrote: Hi Aaron, Your analysis is spot on and I do not believe this is by design. I see the write buffer is owned

Re: Reduce load to hbase

2014-12-07 Thread Pradeep Gollakota
This doesn't answer your question per se, but this is how we dealt with load on HBase at Lithium. We power klout.com with HBase. On a nightly basis, we load user profile data and Klout scores for approx. 600 million users into HBase. We also do maintenance on HBase such as major compactions on a

Re: What companies are using HBase to serve a customer-facing product?

2014-12-06 Thread Pradeep Gollakota
Lithium (Klout) powers www.klout.com with HBase. The operations team is 2 full time engineers + the manager (who also does hands on operations work with the team). This operations team is responsible for the entirety of our Hadoop stack including the HBase clusters. We have one 165 node Hive

Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
Java string's are immutable. So pdfText.concat() returns a new string and the original string is left unmolested. So at the end, all you're doing is returning an empty string. Instead, you can do pdfText = pdfText.concat(...). But the better way to write it is to use a StringBuilder.

Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
it. - Pradeep On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota pradeep...@gmail.com wrote: Java string's are immutable. So pdfText.concat() returns a new string and the original string is left unmolested. So at the end, all you're doing is returning an empty string. Instead, you can do pdfText

Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
how to best do things within the Pig/MapReduce/Hadoop framework Ryan On Fri, Dec 5, 2014 at 1:35 PM, Ryan freelanceflashga...@gmail.com wrote: Thanks Pradeep! I'll give it a try and report back Ryan On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota pradeep...@gmail.com wrote

Re: Using Pig To Scan Hbase

2014-12-05 Thread Pradeep Gollakota
There is a built in storage handler for HBase. Take a look at the docs at https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html It doesn't support dealing with salted rowkeys (or reverse timestamps) out of the box, so you may have to munge with the data a

Re: Custom FileInputFormat.class

2014-12-01 Thread Pradeep Gollakota
Can you expand on your use case a little bit please? It may be that you're duplicating functionality. You can take a look at the CombineFileInputFormat for inspiration. If this is indeed taking a long time, one cheap to implement thing you can do is to parallelize the calls to get block

Re: [protobuf] Parse a .proto file

2014-11-05 Thread Pradeep Gollakota
(fieldDescriptor.toProto()); } } } On Saturday, November 1, 2014 4:26:35 AM UTC-7, Oliver wrote: On 1 November 2014 02:24, Pradeep Gollakota prade...@gmail.com javascript: wrote: Confirmed... When I replaced the md variable with the compiled Descriptor

Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
, October 30, 2014 2:41:19 PM UTC-7, Oliver wrote: On 30 October 2014 02:53, Pradeep Gollakota prade...@gmail.com javascript: wrote: I have a use case where I need to parse messages without having the corresponding precompiled classes in Java. So the DynamicMessage seems

Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
, then I'd go with using the parsed descriptors as your format description, not the text .proto file. Oliver On 31 October 2014 17:56, Pradeep Gollakota prade...@gmail.com javascript: wrote: Hi Oliver, Thanks for the response! I guess my question wasn't quite clear. In my java

Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
on this? Thanks again all! On Friday, October 31, 2014 2:18:44 PM UTC-7, Ilia Mirkin wrote: At no point are you specifying that you want to use the MessagePublish descriptor, so you must still be using the API incorrectly... On Fri, Oct 31, 2014 at 5:10 PM, Pradeep Gollakota prade...@gmail.com

Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
been annotated with the (isPii = true) option. On Friday, October 31, 2014 3:25:51 PM UTC-7, Ilia Mirkin wrote: On Fri, Oct 31, 2014 at 6:18 PM, Pradeep Gollakota prade...@gmail.com javascript: wrote: Boolean extension = fieldDescriptor.getOptions().getExtension(Messages.isPii

Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
try to mix pregenerated code dynamically loaded descriptors, all sorts of things break. Oliver On 1 November 2014 00:48, Pradeep Gollakota pradeep...@gmail.com wrote: Not really... one of the use cases I'm trying to solve for is an anonymization use case. We will have several app's writing

[protobuf] Parse a .proto file

2014-10-30 Thread Pradeep Gollakota
Hi Protobuf gurus, I'm trying to parse a .proto file in Java to use with DynamicMessages. Is this possible or does it have to be compiled to a descriptor set file first before this can be done? I have a use case where I need to parse messages without having the corresponding precompiled

Re: Upgrading a coprocessor

2014-10-29 Thread Pradeep Gollakota
At Lithium, we power Klout using HBase. We load Klout scores for about 500 million users into HBase every night. When a load is happening, we noticed that the performance of klout.com was severely degraded. We also see severely degraded performance when performing operations like compactions. In

Re: Is there a way to indicate that the data is sorted for a group-by operation?

2014-10-13 Thread Pradeep Gollakota
This is a great question! I could be wrong, but I don't believe there is a way to indicate this for a group-by. It definitely does matter for performance if your input is globally sorted. Currently a group by happens on reduce side. But if the input is globally sorted, this can happen map side

Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
Hi Ankur, Is the list of regular expressions static or dynamic? If it's a static list, you can collapse all the filter operators into a single operator and use the AND keyword to combine them. E.g. Filtered_Data = FILTER BagName BY ($0 matches 'RegEx-1') AND ($0 matches 'RegEx-2') AND ($0

Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
In case you haven't seen this already, take a look at http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on optimizing your pig scripts. On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney russell.jur...@gmail.com wrote: Actually, I don't think you need SelectFieldByValue. Just

Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
suggestions. Sorry for not being clear with specification at first place. Thanks. On Mon, Oct 6, 2014 at 4:12 PM, Pradeep Gollakota pradeep...@gmail.com wrote: In case you haven't seen this already, take a look at http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies

Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread Pradeep Gollakota
I agree with the answers suggested above. 3. B 4. D 5. C On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote: Hi No, Pig is a data manipulation language for data already in Hadoop. The question is about importing data from OLTP DB (eg Oracle, MySQL...) to Hadoop, this is what Sqoop

Re: datanode down, disk replaced , /etc/fstab changed. Can't bring it back up. Missing lock file?

2014-10-03 Thread Pradeep Gollakota
Looks like you're facing the same problem as this SO. http://stackoverflow.com/questions/10705140/hadoop-datanode-fails-to-start-throwing-org-apache-hadoop-hdfs-server-common-sto Try the suggested fix. On Fri, Oct 3, 2014 at 6:57 PM, Colin Kincaid Williams disc...@uw.edu wrote: We had a

Re: Block placement without rack aware

2014-10-02 Thread Pradeep Gollakota
It appears to be randomly chosen. I just came across this blog post from Lars George about HBase file locality in HDFS http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote: What is the block placement policy hadoop

Re: Hbase Read/Write poor performance. Please help!

2014-08-13 Thread Pradeep Gollakota
Can you post the client code you're using to read/write from HBase? On Wed, Aug 13, 2014 at 11:21 AM, kacperolszewski kacperolszew...@o2.pl wrote: Hello there, I'm running a read/write benchmark on a huge data (tweeter posts) for my school project. The problem im dealing with is that the

Re: Json Loader - Array of objects - Loading results in empty data set

2014-08-08 Thread Pradeep Gollakota
I think there's a problem with your schema. {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2: chararray)})} should probably look like {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2: chararray))})} On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf

Re: Json Loader - Array of objects - Loading results in empty data set

2014-08-08 Thread Pradeep Gollakota
- Von: Pradeep Gollakota [mailto:pradeep...@gmail.com] Gesendet: Friday, August 08, 2014 2:21 PM An: user@pig.apache.org Betreff: Re: Json Loader - Array of objects - Loading results in empty data set I think there's a problem with your schema. {DataASet: (A1: int,A2: int,DataBSets

Rolling upgrades

2014-08-01 Thread Pradeep Gollakota
Hi All, Is it possible to do a rolling upgrade from Hadoop 2.2 to 2.4? Thanks, Pradeep

Kafka Go Client

2014-07-22 Thread Pradeep Gollakota
Hi All, I was watching the talks from the Kafka meet up at LinkedIn last month. While answering a question on producers spilling to disk, Neha mentioned that there was a Go client that had this feature. I was wondering if the client that does this is https://github.com/Shopify/sarama/issues. I'm

Re: Query On Pig

2014-07-01 Thread Pradeep Gollakota
i. Equals can be mimicked by specifying both = and = (i.e. -lte=123 -gte=123) ii. What do you mean by taking a partial rowkey? the lte and gte are partial matches. On Mon, Jun 30, 2014 at 10:24 PM, Nivetha K nivethak3...@gmail.com wrote: Hi, Iam working with Pig. I need to know some

Re: Query On Pig

2014-07-01 Thread Pradeep Gollakota
consider my rowkeys are 123456,123678,123678,124789,124789.. i need to take the rowkeys starts with 123 On 1 July 2014 11:36, Pradeep Gollakota pradeep...@gmail.com wrote: i. Equals can be mimicked by specifying both = and = (i.e. -lte=123 -gte=123) ii. What do you mean

Re: [DISCUSS] Kafka Security Specific Features

2014-06-06 Thread Pradeep Gollakota
I'm actually not convinced that encryption needs to be handled server side in Kafka. I think the best solution for encryption is to handle it producer/consumer side just like compression. This will offload key management to the users and we'll still be able to leverage the sendfile optimization

Re: Ambari with Druid

2014-06-05 Thread Pradeep Gollakota
, Pradeep Gollakota pradeep...@gmail.com wrote: Ambari has a concept of custom stacks. So, you can write a custom stack to deploy Druid. At installation time, you can choose to install your Druid stack but not the Hadoop stack. On Wed, Jun 4, 2014 at 9:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com

[jira] [Commented] (AMBARI-5707) Replace Ganglia with high performant and pluggable Metrics System

2014-06-04 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/AMBARI-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017422#comment-14017422 ] Pradeep Gollakota commented on AMBARI-5707: --- I too agree that it may

Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email; ​ On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe drah...@googlemail.com wrote: Hi All, I have imported hive table into pig having a complex data type (ARRAYString). The alias in pig looks as below grunt describe A;

Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
Disregard last email. Sorry... didn't fully understand the question. On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota pradeep...@gmail.com wrote: FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email; ​ On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe drah

Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
There was a similar question as this on StackOverflow a while back. The suggestion was to write a custom BagToTuple UDF. http://stackoverflow.com/questions/18544602/how-to-flatten-a-group-into-a-single-tuple-in-pig On Mon, Jun 2, 2014 at 8:46 AM, Pradeep Gollakota pradeep...@gmail.com wrote

Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
,florida) grunt describe B; B: {org.apache.pig.builtin.bagtotuple_cust_address_34::innerfield: chararray} I am not able to seperate the fields in B as $0,$1 and $3 ,tried using STRSPLIT but didnt work. On Mon, Jun 2, 2014 at 11:50 AM, Pradeep Gollakota pradeep...@gmail.com wrote

Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
BagToTuple(cust_address); grunt describe B; B: {org.apache.pig.builtin.bagtotuple_cust_address_24: (innerfield: chararray)} grunt dump B; ((2200,benjamin franklin,philadelphia)) ((44,atlanta franklin,florida)) On Mon, Jun 2, 2014 at 12:59 PM, Pradeep Gollakota pradeep

Re: Upgrade from Hbase 0.94.6 to 0.96 (From CDH45. - CDH 5.0)

2014-06-02 Thread Pradeep Gollakota
Hortonworks has written a bridge tool to help with this. As far as I know, this will only work for replicating from a 0.94 cluster to a 0.96 cluster. Check out https://github.com/hortonworks/HBaseReplicationBridgeServer On Mon, Jun 2, 2014 at 7:35 AM, yanivG yaniv.yancov...@gmail.com wrote:

Re: How to sample an inner bag?

2014-05-27 Thread Pradeep Gollakota
@Mehmet... great hack! I like it :-P On Tue, May 27, 2014 at 5:08 PM, Mehmet Tepedelenlioglu mehmets...@yahoo.com wrote: If you know how many items you want from each inner bag exactly, you can hack it like this: x = foreach x { y = foreach x generate RANDOM() as rnd, *; y =

  1   2   3   >